Getting started with Google BigQuery and GDELT Project
Once upon the time, the new kid on the block left more established search engines in the dust, then, after reinventing web-based email service, Google introduced its Apps. Today, let’s talk about one of the myriad services Google offers to us: BigQuery. Basically, this cloud-based service allows us to utilize Google’s hardware to store our own datasets or access public data on the go. Google provides API for Java, PHP, and Python access. In addition, various third-party tools now connect directly to BigQuery: Tableau, R, JasperSoft, and Simba to name a few. We get a 1 TB monthly usage quota to query BigQuery’s data for free. Some of the downsides of this service include: premiums for storing our own data and querying in excess of the free quota. We are also limited with data manipulation tasks we can perform in BigQuery; in fact, we can only append records to our table, we cannot update or delete them. Finally, this service uses a SQL language dialect, which lacks some of the SQL commands we are accustomed to: DISTINCT comes to mind, or resort us to some convoluted workarounds (try using the TOP command.) Meet, the GDELT Project – “the largest, most comprehensive, and highest resolution open database of human society ever created.” In this tutorial, we will learn some interesting facts about different countries, using GDELT data in BigQuery.
First things first, to get started with BigQuery, you need to do the following:
1. Navigate to Google Developers Console.
2. Accept terms of service.
3. Create a new project.
4. “Load” GDELT table into your project: GDELT Countryinfo.
5. After familiarizing yourself with this table’s schema (fields and data types) and details (table description, documentation, and sample records), click on the Compose Query button to start querying GDELT’s data in BigQuery.
Let’s suppose we need a list of 5 most populous countries in the World. Since this information is readily available, let’s also gather each country’s capital information, and it’s continent. We’ve already covered basics of the SQL SELECT statement in one of the Model Off posts. Below statement selects four fields of interest (country name, country’s capital, continent, and population), the records will be sorted in descending order, and we will display only 5 (top) records only:
SELECT Country, Capital, Continent, Population FROM [gdelt-bq:extra.countryinfo] ORDER BY Population DESC LIMIT 5;
Word of caution is required now: while factually this information is accurate in terms of displaying top 5 most populous countries, actual population estimates are not exactly precise, since we don’t know the last time when this data was updated. Double-checking these numbers against real-time official population clocks, Brazil’s estimate is really accurate, while India’s population appears grossly under-reported. All three remaining countries: China , USA , and Indonesia are somewhat accurate. As a result, we cannot rely on the total world population estimate according to the GDELT table.
The next logical question is to display 5 largest countries by land mass :
SELECT Country, Capital, Continent, Area FROM [gdelt-bq:extra.countryinfo] ORDER BY Area DESC LIMIT 5
Note that GDELT considers Antarctica to be a country.
To display the World’s least densely populated countries, we can run the query below:
SELECT Country, Capital, Continent, Area/Population Density FROM [gdelt-bq:extra.countryinfo] ORDER BY Density DESC LIMIT 25
Next, let’s retrieve the list of countries, which joined the Euro currency zone:
SELECT Country FROM [gdelt-bq:extra.countryinfo] WHERE Currency_Name = 'Euro'
The result of this query includes multiple listings for France, since it lists its overseas departments and territories as independent countries. At the same time, this list is not updated with Lithuania, which joined the Euro zone this year.
The last query using this table will show us all 25 countries located in Oceania, and their corresponding third level internet domains:
SELECT Country, Tld FROM [gdelt-bq:extra.countryinfo] WHERE Continent = 'OC'
Did you know that Micronesia has .fm domain?! Many popular Web services utilize .cc domain, which belongs to Cocos Islands, with population of 596 , and area of 14 square kilometers.
To test your BigQuery skills, why don’t you write the code needed to determine World’s population by continent and generate records below? Please share your findings in the comments section of this post:
If you want to see the real power behind Google BigQuery and GDELT data, you could watch this inspirational video, where Dr Kalev H. Leetaru (GDELT’s creator) shows how recent events in Ukraine can be analyzed with the data that he gathered:
You might be interested in the query code that Dr. Leetaru used in his presentation:
SELECT MonthYear MonthYear, INTEGER(norm*100000)/1000 Percent FROM ( SELECT ActionGeo_CountryCode, EventRootCode, MonthYear, COUNT(1) AS c, RATIO_TO_REPORT(c) OVER(PARTITION BY MonthYear ORDER BY c DESC) norm FROM [gdelt-bq:full.events] GROUP BY ActionGeo_CountryCode, EventRootCode, MonthYear ) WHERE ActionGeo_CountryCode='UP' and EventRootCode='14' ORDER BY ActionGeo_CountryCode, EventRootCode, MonthYear;