Get acquainted with Google Dataset Search.

Google Dataset Search

Google has been dominating web search for nearly two decades and it’s acquisition of YouTube resulted in the second most popular search engine in the world. Yet, it seemingly lost the product search niche to Amazon. It’s not surprising that amidst growing interest in all things data, including public and open data, this tech giant would be keen on developing a search product geared towards making dataset search easier. What is surprising, is how long it took them to develop and release this product, which was officially introduced to general public on January 23rd, 2020 after spending more than 16 months in beta testing. You can embark on your own dataset search journey here.

Since there is a plethora of abundant options available for you consideration, the process of finding the best data set most relevant for your requirements and circumstances can become a time-consuming effort. Enter the world of Google dataset search and the list of trusted data providers that can help you tackle this undertaking with less effort and more results, leaving you with enough energy to work on the actual data analysis part, rather than data hunting and cleansing.

Google’s announcement included updated index, now including almost 25 million (!) datasets and improved search capability by data type. We can search results in table, document, images, text, archive, and other data types. Table seems to be the most logical data type people would search for [luckily more than quarter of datasets found are in this format] and I’m not sure how practical items such as Image wold be. Other filter options include time period of last update (1 month/1year/3years), usage rights, and Free license flag:

Google Dataset Search filter options

In its effort to standardize taxonomy of data crawled, Google partnered with Microsoft, Yahoo, and Russia’s Yandex to found Schma.org, a

“collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. “

This partnership makes it easier for developers to release datasets that can be easily found by these search giants, yet it also presents a limitation wherein majority of Web’s datasets are not compliant with this standard for metadata tagging. At a first glance many of the results come from Academia-based sources, possibly due to strong efforts of various educational institutions’ to publish their metadata and make scientific research more accessible. Biggest topics covered include biology, agriculture and geonsciences, followed by open government data. US appears to be the leader in this space with over 2 million open government-related datasets available.

I was curious to see what kind of eCommerce datasets I could find and interestingly enough, there were 34 dataset matches in Table format, offered for free, published in the past year. Clicking on the result will bring up a snapshot of the data source, including detailed properties describing it:

Search results take you directly to the page hosting this data, but a few of the datasets (especially open government ones) are actually available from different places:

Looking to learn more about making your own dataset discoverable by Google, you can explore their FAQ page. To learn about their motivation to develop this product, head here.

Even with 25 million datasets available, this index is simply a tip of the iceberg considering potential number of datasets “in the wild.” Limitations include: primary focus being science, heavily skewed over-representation of US-provided datasets, and strict adherence to Schma.org.

Time will tell how successful further development of this product will be, yet I’d think that most of it’s potential target audience probably picked their favorites by now. I certainly am partial to other Google-led projects, such as Kaggle data and BigQuery‘s public datasets. Here is my attempt to catalog a comprehensive list of public and open dataset resoruces.

Have you had a chance to use Google Dataset search yet? Do you already have a preferred dataset provider? Please share your story in the comments section.

Leave a Reply

Your email address will not be published. Required fields are marked *