Working with sample datasets in BigQuery
In the previous post we added public tables to our BigQuery interface. However, Google already provides sample data on various topics by default. While most of these tables are not updated, they still present some interest in terms of learning trends or insights on a multitude of topics. We will focus on 3 of these tables:
Natality (daily US births from 1969 to 2008),
GSOD (daily weather information by a station number from 1929 to 2009),
and Shakespeare (word index of Shakespeare’s works.)
Let’s start our exploration with the Natality dataset. The graph above charts share of teenager births, comparing to grand total by year. Between 1969, nominal number of births by teenagers went up from 307,561 to 441,110. However, this is not necessarily a bad news, considering growing US population. While in 1973, almost every fifth birth (19.55%) was by a teenager mother, by 2005 this ratio dropped to every 10th birth (10.18%.) To pull relevant source data, we simply need to run the following query (which would incidentally retrieve preteen births as well [outliers representing fewer than 200 births a year.]):
SELECT B.Year, COUNT(A.Year) Births, B.Births BirthsUnder20 FROM publicdata:samples.natality A INNER JOIN (SELECT year, COUNT (*) Births FROM publicdata:samples.natality WHERE mother_age <20 GROUP BY year ) B ON A.Year = B.Year GROUP BY B.Year, BirthsUnder20 ;
One can argue that this subquery method might be too resource-intensive for what we are trying to achieve. Let’s employ a different methodology to answer the next question: what is the number of twins, triplets, quadruplets (four offspring), and quintuplets (five offspring) born in US between 2004 and 2008? Isn’t it interesting to know that for every quintuplet set born in 2008, there were 3,018 (!) sets of twins born in 2007?! At the same time this ratio was “only” 1,529 in 2007, with 91 quintuplets and 139,209 twins. We can utilize CASE SQL command in order to perform conditional SELECT calculations:
SELECT Year, SUM(CASE WHEN plurality = 2 THEN 1 ELSE 0 END) Twins, SUM(CASE WHEN plurality = 3 THEN 1 ELSE 0 END) Triplets, SUM(CASE WHEN plurality = 4 THEN 1 ELSE 0 END ) Quadruplets, SUM(CASE WHEN plurality = 5 THEN 1 ELSE 0 END ) Quintuplets FROM publicdata:samples.natality WHERE YEAR > 2003 GROUP BY 1 ORDER BY 1 ASC ;
The resulting table becomes:
Should you want to see state of birth for 10 heaviest babies on record, you might want to run such as below. Kentucky seems to be the leader in this category, especially in 2004…
SELECT weight_pounds, state, year, month, gestation_weeks FROM publicdata:samples.natality WHERE state is not null ORDER BY weight_pounds DESC LIMIT 10 ;
It finally starts to feel like the Spring is coming to Chicagoland. How bad the weather really is in this metro area, home to over 9 million people? Using GSOD data, we can stop guessing and verify our assumptions. First we need to decide on the best place (weather station) to gather this information. Surely the city data will vary greatly from the suburbs numbers, but using station number 725300 – Chicago O’Hare Airport seems like a good choice to me.
Knowing that 2009 is the most recent full year available, let’s first look at the coldest years on record. It’s safe to say that when mean temperature for any given day is below 32 degrees, such a day can referred to as “freezing.” Using BigQuery’s syntax for the TOP command, we can run code below to retrieve five years with highest number of freezing days:
SELECT TOP (year, 5) Year, count(*) FreezingDays FROM [publicdata:samples.gsod] WHERE station_number = 725300 AND wban_number = 94846 AND mean_temp <32
Looking at the table below, we learn that O’Hare didn’t have more than 100 freezing days since 1978:
This is definitely an arbitrary ranking, but I will now attempt to quantify freezing, cold, good, ideal and other days per year. Freezing days are the ones with mean average temperature of under 32 degrees Fahrenheit, assume that Cold days are the ones in the range of 32F to 40F, Good days are the ones in the range of 41F to 59F, Ideal days are the ones in the range of 60F to 75F, and finally, Ideal days are the ones with mean temperature over 75F:
SELECT Year, SUM(CASE WHEN mean_temp <32 THEN 1 ELSE 0 END) FreezingDays, SUM(CASE WHEN mean_temp <32 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal1, SUM(CASE WHEN mean_temp BETWEEN 32 AND 40 THEN 1 ELSE 0 END) ColdDays, SUM(CASE WHEN mean_temp BETWEEN 32 AND 40 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal2, SUM(CASE WHEN mean_temp BETWEEN 41 AND 59 THEN 1 ELSE 0 END) GoodDays, SUM(CASE WHEN mean_temp BETWEEN 41 AND 59 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal3, SUM(CASE WHEN mean_temp BETWEEN 60 AND 75 THEN 1 ELSE 0 END) IdealDays, SUM(CASE WHEN mean_temp BETWEEN 60 AND 75 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal4, SUM(CASE WHEN mean_temp > 75 THEN 1 ELSE 0 END) HotDays, SUM(CASE WHEN mean_temp > 75 THEN 1 ELSE 0 END)/COUNT(*)*100 PercentOfTotal5, COUNT(*) TotalDays FROM [publicdata:samples.gsod] WHERE station_number = 725300 AND wban_number = 94846 AND year BETWEEN 2000 AND 2009 GROUP BY 1 ORDER BY 1 DESC;
We had to wait the whole 1.3 seconds to process 3.41 GB (!) worth of data to see the resulting dataset. We can now see that Chicagoland is not as bad as one would think: after all, over 221 days in any given recent year had a mean temperature over 40 degrees Fahrenheit:
We have gathered some interesting insight about US birth rates and Chicago weather so far, let’s finish this post with something different: text analytics of Shakespeare’s works. Let’s retrieve 25 most used words in “Romeo And Juliet.” Interestingly enough, love didn’t even make the list, none of the words, with the exception of proper names for the main characters are over 4 characters long, and at that, only 2 words were 4 characters long….
SELECT word, word_count FROM [publicdata:samples.shakespeare] WHERE corpus = 'romeoandjuliet' ORDER BY word_count DESC LIMIT 25
Perhaps you can now write the SQL code to retrieve number of occurrences of word love in each of Shakespeare’s work?!