Working with sample BigQuery tables

Working with sample datasets in BigQuery

Working with sample datasets in BigQuery

          In the previous post we added public tables to our BigQuery interface. However, Google already provides sample data on various topics by default. While most of these tables are not updated, they still present some interest in terms of learning trends or insights on a multitude of topics. We will focus on 3 of these tables:
Natality (daily US births from 1969 to 2008),
GSOD (daily weather information by a station number from 1929 to 2009),
and Shakespeare (word index of Shakespeare’s works.)

          Let’s start our exploration with the Natality dataset. The graph above charts share of teenager births, comparing to grand total by year. Between 1969, nominal number of births by teenagers went up from 307,561 to 441,110. However, this is not necessarily a bad news, considering growing US population. While in 1973, almost every fifth birth (19.55%) was by a teenager mother, by 2005 this ratio dropped to every 10th birth (10.18%.) To pull relevant source data, we simply need to run the following query (which would incidentally retrieve preteen births as well [outliers representing fewer than 200 births a year.]):

SELECT B.Year, COUNT(A.Year) Births, B.Births BirthsUnder20    
FROM       
publicdata:samples.natality A      
INNER JOIN     
(SELECT year, COUNT (*) Births     
      FROM publicdata:samples.natality     
      WHERE mother_age <20      
      GROUP BY year ) B    
ON A.Year = B.Year     
GROUP BY B.Year, BirthsUnder20 ; 	

          One can argue that this subquery method might be too resource-intensive for what we are trying to achieve. Let’s employ a different methodology to answer the next question: what is the number of twins, triplets, quadruplets (four offspring), and quintuplets (five offspring) born in US between 2004 and 2008? Isn’t it interesting to know that for every quintuplet set born in 2008, there were 3,018 (!) sets of twins born in 2007?! At the same time this ratio was “only” 1,529 in 2007, with 91 quintuplets and 139,209 twins. We can utilize CASE SQL command in order to perform conditional SELECT calculations:

SELECT Year, SUM(CASE WHEN plurality = 2 THEN 1 ELSE 0 END) Twins,
SUM(CASE WHEN plurality = 3 THEN 1 ELSE 0 END) Triplets,
SUM(CASE WHEN plurality = 4 THEN 1 ELSE 0 END ) Quadruplets, 
SUM(CASE WHEN plurality = 5 THEN 1 ELSE 0 END ) Quintuplets
FROM
publicdata:samples.natality 
WHERE YEAR > 2003
GROUP BY 1
ORDER BY 1 ASC ;

          The resulting table becomes:

Number of multiple births per year

          Should you want to see state of birth for 10 heaviest babies on record, you might want to run such as below. Kentucky seems to be the leader in this category, especially in 2004…

SELECT weight_pounds, state, year, month, gestation_weeks
FROM publicdata:samples.natality
WHERE state is not null
ORDER BY weight_pounds DESC
LIMIT 10 ;

          It finally starts to feel like the Spring is coming to Chicagoland. How bad the weather really is in this metro area, home to over 9 million people? Using GSOD data, we can stop guessing and verify our assumptions. First we need to decide on the best place (weather station) to gather this information. Surely the city data will vary greatly from the suburbs numbers, but using station number 725300 Chicago O’Hare Airport seems like a good choice to me.

          Knowing that 2009 is the most recent full year available, let’s first look at the coldest years on record. It’s safe to say that when mean temperature for any given day is below 32 degrees, such a day can referred to as “freezing.” Using BigQuery’s syntax for the TOP command, we can run code below to retrieve five years with highest number of freezing days:

SELECT TOP (year, 5) Year, count(*) FreezingDays
FROM [publicdata:samples.gsod] 
WHERE station_number = 725300 AND wban_number = 94846
      AND mean_temp <32

          Looking at the table below, we learn that O’Hare didn’t have more than 100 freezing days since 1978:

Number of freezing days in Chicago O'Hare

          This is definitely an arbitrary ranking, but I will now attempt to quantify freezing, cold, good, ideal and other days per year. Freezing days are the ones with mean average temperature of under 32 degrees Fahrenheit, assume that Cold days are the ones in the range of 32F to 40F, Good days are the ones in the range of 41F to 59F, Ideal days are the ones in the range of 60F to 75F, and finally, Ideal days are the ones with mean temperature over 75F:

SELECT Year, SUM(CASE WHEN mean_temp <32 THEN 1 ELSE 0 END) FreezingDays,	
SUM(CASE WHEN mean_temp <32 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal1,	
SUM(CASE WHEN mean_temp BETWEEN 32 AND 40 THEN 1 ELSE 0 END) ColdDays,	
SUM(CASE WHEN mean_temp BETWEEN 32 AND 40 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal2,	
SUM(CASE WHEN mean_temp BETWEEN 41 AND 59 THEN 1 ELSE 0 END) GoodDays,	
SUM(CASE WHEN mean_temp BETWEEN 41 AND 59 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal3,	
SUM(CASE WHEN mean_temp BETWEEN 60 AND 75 THEN 1 ELSE 0 END) IdealDays,	
SUM(CASE WHEN mean_temp BETWEEN 60 AND 75 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal4,	
SUM(CASE WHEN mean_temp > 75 THEN 1 ELSE 0 END) HotDays,	
SUM(CASE WHEN mean_temp > 75 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal5,	
COUNT(*) TotalDays	
FROM [publicdata:samples.gsod] 	
WHERE station_number = 725300 AND wban_number = 94846	
      AND year BETWEEN 2000 AND 2009	
GROUP BY 1	
ORDER BY 1 DESC;	

          We had to wait the whole 1.3 seconds to process 3.41 GB (!) worth of data to see the resulting dataset. We can now see that Chicagoland is not as bad as one would think: after all, over 221 days in any given recent year had a mean temperature over 40 degrees Fahrenheit:

Break-own of days by year for Chicago weather

          We have gathered some interesting insight about US birth rates and Chicago weather so far, let’s finish this post with something different: text analytics of Shakespeare’s works. Let’s retrieve 25 most used words in “Romeo And Juliet.” Interestingly enough, love didn’t even make the list, none of the words, with the exception of proper names for the main characters are over 4 characters long, and at that, only 2 words were 4 characters long….

SELECT word, word_count
FROM [publicdata:samples.shakespeare] 
WHERE corpus = 'romeoandjuliet'
ORDER BY word_count DESC
LIMIT 25

          Perhaps you can now write the SQL code to retrieve number of occurrences of word love in each of Shakespeare’s work?!

Word Count - Shakespeare






1 comment

Leave a Reply

Your email address will not be published. Required fields are marked *