Getting started with Google BigQuery and GDELT Project

Getting started with Google BigQuery and GDELT

          Once upon the time, the new kid on the block left more established search engines in the dust, then, after reinventing web-based email service, Google introduced its Apps. Today, let’s talk about one of the myriad services Google offers to us: BigQuery. Basically, this cloud-based service allows us to utilize Google’s hardware to store our own datasets or access public data on the go. Google provides API for Java, PHP, and Python access. In addition, various third-party tools now connect directly to BigQuery: Tableau, R, JasperSoft, and Simba to name a few. We get a 1 TB monthly usage quota to query BigQuery’s data for free. Some of the downsides of this service include: premiums for storing our own data and querying in excess of the free quota. We are also limited with data manipulation tasks we can perform in BigQuery; in fact, we can only append records to our table, we cannot update or delete them. Finally, this service uses a SQL language dialect, which lacks some of the SQL commands we are accustomed to: DISTINCT comes to mind, or resort us to some convoluted workarounds (try using the TOP command.) Meet, the GDELT Project – “the largest, most comprehensive, and highest resolution open database of human society ever created.” In this tutorial, we will learn some interesting facts about different countries, using GDELT data in BigQuery.

Continue reading

Where are the jobs? 2014 INC 5000 List might provide some answers…

2014 INC 5000 data viz

          What do Domino’s Pizza, Microsoft, Timberland, Tough Mudder, Intuit, Vermont Teddy Bear, E*Trade, Lending Club, Morningstar, Oracle, Fuhu, Cold Stone Creamery, Under Armour, and GoDaddy have in common? They all have appeared on Inc Magazine’s INC 500 list of the fastest-growing American private companies. Looking at the last year’s INC 5000 list might help us draw employment insights and identify best industries and places to look for a job. By definition, these company are experiencing tremendous growth, and as such, could be representative of future job opportunities in the US. In addition, according to Forbes , small businesses added over 65% of the net new jobs in the past two decades. So, where are the jobs?! Based on INC 5000 data, Chicago, IL employs over 5% [56,813] of all INC 5000 workers. Another 14% [144,847] of INC 5000 employees work in California. Top 3 industries by employment include: Human Resources, Business Products & Services, and IT Services . Together, they are responsible for almost 400,000 jobs, or 38% of the grand total.

Continue reading






Solving ModelOff Data Analysis problem using Microsoft Access SQL.

MS Access Solution for ModelOff 2013 Data Analysis problem

         Last week we solved ModelOff’s Data Analysis problem from their 2013 championship. Since the second round of 2014 Model Off competition takes place this Saturday, November 8th, let’s pay respect to the data superheroes making it thus far. Our previous ModelOff solution involved using PivotTable feature of Microsoft Excel. Would you believe that we can realistically conceive a solution to the Data analysis problem, using Microsoft Access, or even better, Microsoft’s flavor of the SQL language?

         I am as big of an Excel fan as the next guy, much bigger, actually, on the second thought. However, I also believe that when possible, using the right tools for the job will yield better, faster results, than duct-taping your workarounds. So, why Access? Why on Earth, SQL? Well, let’s go to the source: ” If you often have to view your data in a variety of ways, depending on changing conditions or events, Access might be the better choice for storing and working with your data. Access lets you use Structured Query Language (SQL) queries to quickly retrieve just the rows and columns of data that you want, whether the data is contained in one table or many tables. You can also use expressions in queries to create calculated fields.”
Continue reading

Solving ModelOff Data Analysis problem from the second round of 2013 competition.

Solution for ModelOff 2013 Data Analysis problem


         As some of you might know, Model Off 2014, championship kicked off last Saturday, October 25th. With the first round of this modeling competition behind us, I thought that this might be a good time to tackle one of their older problems. I might be biased here, but starting with their Data Analysis problem made a perfect sense to me. Luckily enough, this could arguably be one of the easiest problems they have ever presented. You can download both: Questions/Answers worksheet and Excel workbook with source data for this problem directly from their website.

         As usual, I will not pretend to have the best possible, optimal solution, I simply offer one which works. Since in the real world, we are not constrained by challenging time restrictions, imposed on ModelOff contestants, I was not concerned about the most efficient solution, but rather the one that is more presentable. As an example, my solution heavily relies on Top Ten filtering feature available for PivotTables, this might seem like an extra step comparing to simply presenting all records and then sorting them in order of preference. In my humble opinion this is a better way to present top 1/3/5/n results, rather than looking at extra records of data.

Continue reading