Working with sample datasets in BigQuery
In the previous post we added public tables to our BigQuery interface. However, Google already provides sample data on various topics by default. While most of these tables are not updated, they still present some interest in terms of learning trends or insights on a multitude of topics. We will focus on 3 of these tables:
Natality (daily US births from 1969 to 2008),
GSOD (daily weather information by a station number from 1929 to 2009),
and Shakespeare (word index of Shakespeare’s works.)
Let’s start our exploration with the Natality dataset. The graph above charts share of teenager births, comparing to grand total by year. Between 1969, nominal number of births by teenagers went up from 307,561 to 441,110. However, this is not necessarily a bad news, considering growing US population. While in 1973, almost every fifth birth (19.55%) was by a teenager mother, by 2005 this ratio dropped to every 10th birth (10.18%.) To pull relevant source data, we simply need to run the following query (which would incidentally retrieve preteen births as well [outliers representing fewer than 200 births a year.]):
Getting started with Google BigQuery and GDELT Project
Once upon the time, the new kid on the block left more established search engines in the dust, then, after reinventing web-based email service, Google introduced its Apps. Today, let’s talk about one of the myriad services Google offers to us: BigQuery. Basically, this cloud-based service allows us to utilize Google’s hardware to store our own datasets or access public data on the go. Google provides API for Java, PHP, and Python access. In addition, various third-party tools now connect directly to BigQuery: Tableau, R, JasperSoft, and Simba to name a few. We get a 1 TB monthly usage quota to query BigQuery’s data for free. Some of the downsides of this service include: premiums for storing our own data and querying in excess of the free quota. We are also limited with data manipulation tasks we can perform in BigQuery; in fact, we can only append records to our table, we cannot update or delete them. Finally, this service uses a SQL language dialect, which lacks some of the SQL commands we are accustomed to: DISTINCT comes to mind, or resort us to some convoluted workarounds (try using the TOP command.) Meet, the GDELT Project – “the largest, most comprehensive, and highest resolution open database of human society ever created.” In this tutorial, we will learn some interesting facts about different countries, using GDELT data in BigQuery.
Highlights of Gary Cokins Business Analytics seminar
Last month I attened Business Analytics seminar by “an internationally recognized expert, speaker, and author in performance improvement systems and cost management” : Gary Cokins.
My slideshare presentation in the bottom of this post summarizes Gary’s informative and thought-provoiking presentation. In fact, some statements in this post were provoked by this seminar. Gary started by quoting Jeanne X. Harris :“40% of important decisions are not based on facts but rather on intuition, experience, and anecdotal evidence.” Arguably, we could improve the world by basing our decisions on facts, and making better decisions in the process. In addition, better decisions result in better actions, saving us time, effort, money, and other resources. The “actions” part is really important in terms of differentiating Business Analytics from Business Intelligence.
Solving ModelOff Data Analysis problem using Microsoft Access SQL.
Last week we solved ModelOff’s Data Analysis problem from their 2013 championship. Since the second round of 2014 Model Off competition takes place this Saturday, November 8th, let’s pay respect to the data superheroes making it thus far. Our previous ModelOff solution involved using PivotTable feature of Microsoft Excel. Would you believe that we can realistically conceive a solution to the Data analysis problem, using Microsoft Access, or even better, Microsoft’s flavor of the SQL language?
I am as big of an Excel fan as the next guy, much bigger, actually, on the second thought. However, I also believe that when possible, using the right tools for the job will yield better, faster results, than duct-taping your workarounds. So, why Access? Why on Earth, SQL? Well, let’s go to the source: ” If you often have to view your data in a variety of ways, depending on changing conditions or events, Access might be the better choice for storing and working with your data. Access lets you use Structured Query Language (SQL) queries to quickly retrieve just the rows and columns of data that you want, whether the data is contained in one table or many tables. You can also use expressions in queries to create calculated fields.”