Working with sample datasets in BigQuery
In the previous post we added public tables to our BigQuery interface. However, Google already provides sample data on various topics by default. While most of these tables are not updated, they still present some interest in terms of learning trends or insights on a multitude of topics. We will focus on 3 of these tables:
Natality (daily US births from 1969 to 2008),
GSOD (daily weather information by a station number from 1929 to 2009),
and Shakespeare (word index of Shakespeare’s works.)
Let’s start our exploration with the Natality dataset. The graph above charts share of teenager births, comparing to grand total by year. Between 1969, nominal number of births by teenagers went up from 307,561 to 441,110. However, this is not necessarily a bad news, considering growing US population. While in 1973, almost every fifth birth (19.55%) was by a teenager mother, by 2005 this ratio dropped to every 10th birth (10.18%.) To pull relevant source data, we simply need to run the following query (which would incidentally retrieve preteen births as well [outliers representing fewer than 200 births a year.]):
Getting started with Google BigQuery and GDELT Project
Once upon the time, the new kid on the block left more established search engines in the dust, then, after reinventing web-based email service, Google introduced its Apps. Today, let’s talk about one of the myriad services Google offers to us: BigQuery. Basically, this cloud-based service allows us to utilize Google’s hardware to store our own datasets or access public data on the go. Google provides API for Java, PHP, and Python access. In addition, various third-party tools now connect directly to BigQuery: Tableau, R, JasperSoft, and Simba to name a few. We get a 1 TB monthly usage quota to query BigQuery’s data for free. Some of the downsides of this service include: premiums for storing our own data and querying in excess of the free quota. We are also limited with data manipulation tasks we can perform in BigQuery; in fact, we can only append records to our table, we cannot update or delete them. Finally, this service uses a SQL language dialect, which lacks some of the SQL commands we are accustomed to: DISTINCT comes to mind, or resort us to some convoluted workarounds (try using the TOP command.) Meet, the GDELT Project – “the largest, most comprehensive, and highest resolution open database of human society ever created.” In this tutorial, we will learn some interesting facts about different countries, using GDELT data in BigQuery.
Where are the jobs? 2014 INC 5000 List might provide some answers…
What do Domino’s Pizza, Microsoft, Timberland, Tough Mudder, Intuit, Vermont Teddy Bear, E*Trade, Lending Club, Morningstar, Oracle, Fuhu, Cold Stone Creamery, Under Armour, and GoDaddy have in common? They all have appeared on Inc Magazine’s INC 500 list of the fastest-growing American private companies. Looking at the last year’s INC 5000 list might help us draw employment insights and identify best industries and places to look for a job. By definition, these company are experiencing tremendous growth, and as such, could be representative of future job opportunities in the US. In addition, according to Forbes , small businesses added over 65% of the net new jobs in the past two decades. So, where are the jobs?! Based on INC 5000 data, Chicago, IL employs over 5% [56,813] of all INC 5000 workers. Another 14% [144,847] of INC 5000 employees work in California. Top 3 industries by employment include: Human Resources, Business Products & Services, and IT Services . Together, they are responsible for almost 400,000 jobs, or 38% of the grand total.
Happy New Year! Highlights of 2014.
2015 is quickly approaching, and I wanted to share the brightest highlights of 2014 with my readers. All of the events are listed in chronological order:
#1. First blog post , on Income Tax Rate inequality in US published on August 27th.
#2. First influencer follower on Twitter – @Modeloff . World Financial Modeling Championship followed me on Twitter on October 27th. They subsequently retweeted my solution to the 2013 Modeloff Data Analysis problem.
#3. Official Twitter account of my favorite software application – @MsExcel (if you really have to ask) , followed us on December 10th.
The opportunities are now endless. Who knows, perhaps they will grant my New Year’s wish:
#4. One SlideShare presentation viewed 1,000 times , My presentation on the best AT&T plan for your iPhone6, reached 1,000 views on December 16th. All four of my presentation have accumulated 2,000 views on the last day of 2014…
#5. Like on LinkedIn update from a real-life Excel MVP , Jordan Goldmeier – Excel MVP of @OptionExplicit and Cambia Factor fame liked my LinkedIn update on working with nested if functions in Excel on December 26th.
Have a very happy, healthy, and prosperous 2015!