Working with Google Sheets’ QUERY Function
So far we’ve covered some ground when it comes to using spreadsheet functions within MS Excel ecosystem. Perhaps it’s time to turn to one of its leading competitors: Google Sheets ? When GoogleSheets was launched noone, but perhaps its creators would seriously consider it to be a real threat the mighty, ubiquitous Excel spreadsheet package. Over the years this product went through a plethora of incremental, as well as break-through enhancements. By now GoogleSheets achieved an admirable user base backing it’s strong functionality matching and in some instances surpassing the gold standard continuously maintained by the much beloved Excel program.
One of most versatile Sheets’ functions is the QUERY function. Its SQL-like syntax allows us to retrieve specific information from our worksheet, enabling various filtering, sorting, lookup, and aggregation capabilities. While this function is more powerful than INDEX/MATCH, VLOOKUP, SUMPRODUCT, SUMIFS and others, its syntax is rather simple: QUERY(Data, Query, [Headers]) .
Data – range of cells we want to query from
Query – actual query code, to be enclosed in quotation marks or be a reference to a cell containing query text.
Headers – OPTIONAL, number of header rows at the top of data, when omitted value is set to -1.
If you can comfortably compose basic SQL queries, you will find this function rather intuitive and easy to use. However, you might also find yourself wanting to throw your keyboard across the room when some of the SQL functionality we take for granted is either not found or not implemented properly in this function. Working with aliases has definitely topped my personal list of annoyances.
To get started, let’s retrieve some data we want to analyze further. We’ve already talked here about various Big Query sample data sets . For no particular reason I landed on using the Natality table to retrieve a sample of 10,000 records for babies born in Illinois in 1980. Among other things, this table includes parents’ age, race, baby’s weight, sex, date of birth, etc. If you are interested in replicating my results below is the query I used:
SELECT year, month, day, state, weight_pounds, child_race, is_male, plurality, mother_residence_state, mother_age, gestation_weeks, mother_married, mother_birth_state, mother_race, father_race, father_age from [bigquery-public-data:samples.natality] where day is not null and state is not null and mother_residence_state is not null and mother_birth_state is not null and state = 'IL' and year = 1980 limit 10000
After running the above query I decided to create some helper columns to augment this dataset, including using the Date field as a concatenation of year/month/day from the original data, determining Day of Week based on that date, looking up baby’s and parents’ race information from the lookup table and filtering for foreign-born mothers:
I. Making a copy of your data.
First things first, we have options when referring to the data range of interest: we can specify columns we would like to be considered in our query, we can refer to the specific data range, or it might make sense for us to create a named range and reference it in our formula. For the purposes of our workbook, underlying data is found within the NatalityData tab, cells A1 through V10001 specifically, where first row consists of column headings and remaining 10,000 rows contain our data. To give you options on how to use this function, I also created a named range called LKP_Natality to refer to the same data.
Query’s code Select * will do the job whenever we need to retrieve full data set. Remembering different ways to refer to our data range, all three options below will yield exactly the same results:
=QUERY(NatalityData!$A$1:$V$10001,"SELECT * ")
OR
=QUERY(NatalityData!A:V,"SELECT * ")
OR
=QUERY(LKP_Natality,"SELECT * ")
While I will leave it up to you to decide which option to choose, for the purposes of this post I will use the first option: I don’t want to inadvertently select junk data in a random row 123,456 which could potentially happen with the second option; nor do I want to over-complicate things by using named range’s functionality just yet.
II. Fetching data from a specific column/field.
Syntax used for retrieving data from a specific column will definitely be familiar to the SQL users, yet there is a nuance to be considered here; we cannot simply use column headings/field names in our query code, rather having to use column name here. The below code will retrieve all records corresponding to day of week and month of birth from our data set:
=QUERY(NatalityData!$A$1:$V$100001,"SELECT B, H")
III. Retrieving unique list of records.
Perhaps it would be even more useful to get the de-duped version of our data instead of getting all records. For some reason SQL’s trusted DISTINCT keyword would be of no use here, yet there is a way around of this limitation using GoogleSheets’ own UNIQUE function and wrapping QUERY inside of it:
=UNIQUE(QUERY(NatalityData!$A$1:$V$10001,"SELECT B "))
This code will help us get the unique listing of all days of the week present in our data set (all seven of them):
IV. Applying a condition to your query (WHERE clause).
While getting data from specific columns is quite powerful, arguably the most powerful feature of SQL code is the ability to specify custom condition(s) to be applied to our data to better meet our analysis needs. Luckily QUERY function delivers here.
Looking at our data, let’s answer some questions one might ask:
1. What is the percentage of twins in our data set?
Assuming our records are representative of the total, there appears to be no octuplets born in IL in 1980, with almost 2% babies (184 out of 10,000) being twins. In order to write the below query we simply need to understand our data and know that the Plurality column shows number of babies being born to the same mother, while COUNT keyword helps us calculate number of records meeting our criteria, and it can be applied to any field within our data, I chose to use column A for simplicity:
=QUERY(NatalityData!$A$1:$V$10001,"SELECT COUNT(A) WHERE N = 2 ")
2. How many baby girls were there?
Using the IS_MALE flag we need to query for all instances where this flag is set to FALSE value, yielding the result of 4,896 [48.9% baby girls in the mix]:
=QUERY(NatalityData!$A$1:$V$10001,"SELECT COUNT (A) WHERE M = FALSE ")
3. How many mothers were not born in Illinois?
Since we have access to both: state of birth for the mother vs. state where the mother gave the birth, we can deduce that a great chunk of all Illinois babies were born to mothers not born in Illinois totaling rather high 3,754 estimate. Not equal sign <> comes very handy here:
=QUERY(NatalityData!$A$1:$V$10001,"SELECT COUNT(A) WHERE J <> S ")
V. Renaming your query column names/working with aliases.
As mentioned in the beginning of the post one of the most frustrating limitations of the QUERY function for me was the lack of simple aliases – column naming intuitive and omnipresent in all flavors of SQL. What if we needed to revisit the first WHERE clause example and rename somewhat hard to follow count Date to a more straight-forward name. AS keyword is not part of the QUERY function, hence let’s use the closest next thing – LABEL function:
=QUERY(NatalityData!$A$1:$V$10001, "SELECT COUNT(A) WHERE N = 2 LABEL COUNT (A) 'Number of twins' ")
Hopefully you could agree that this version results in a more presentable query outcome:
VI. Grouping your data (GROUP BY clause).
Right next to the WHERE clause in the order of importance I would place the GROUP BY clause allowing us to summarize our data at the desired level of granularity. Suppose we would like to know number of births per mother’s age, then we would group our data at the mother’s age level (unique listing of all ages) and aggregate this data set using COUNT function. Who knew that 11.09% of all babies born in IL in 1980 were born to mothers aged 18 or younger:
=QUERY(NatalityData!$A$1:$V$10001, "SELECT P, COUNT(A) GROUP BY P LABEL P 'Mother Age', COUNT(A) 'Number of births' ")
VII. Sorting your data (ORDER BY clause) .
A great reporting feature is using ORDER BY clause to sort our data set based on column order of interest. While the default sorting option is based on ascending order, we can fully customize our query as far as sorting order is concerned. What if we wanted to sort our data by day of week in descending order of number of babies born? Interestingly enough, Wednesday shows the largest number of births, with Sunday being the slowest day for this action:
=QUERY(NatalityData!$A$1:$V$10001, "SELECT B, COUNT(A) GROUP BY B ORDER BY COUNT(A) DESC LABEL COUNT(A) 'Number of births' ")
VIII. Turning your data into a pivot table (PIVOT BY clause) .
Being first and foremost the spreadsheet-processing software, GoogleSheets excels (no pun intended) when it comes to creating Pivot Table summary of your data, something that not all databases can handle. Perhaps we want to look deeper into the previous example and see if different days of the week show different peaks/valleys by race (don’t ask me for the rationale used in race-based categorization used in this data set.) In fact, according to our data there were some differences between the races when it comes to the most/least popular days for the baby births. In order to proceed with this solution we are looking to organize/group our data by mother’s race, while pivoting it by day of week:
=QUERY(NatalityData!$A$1:$V$10001, "SELECT C, COUNT(A) GROUP BY C PIVOT B ")
To see these and some other examples of QUERY function; as well as actual data source used, feel free to download a copy of my worksheet
How do you currently use QUERY function, do you have any other worksheet functions you would like us to cover here?
Amazing! Its really awesome paragraph, I have got much clear idea on the
topic of from this paragraph.
That’s great… but how about an article that shows us the Excel equivalent? (or at least a workable comparison?)
Thanks for stopping by, Nick – I’d love to use Excel to create an equivalent, but as per my post, it simply does not exist at this point. The closest that we could get would be using PowerQuery for these purposes.