Dataset Search for Data Discovery, Augmentation, and Explanation
Juliana Freire (New York University)
12 November 2024, 12:45-13:45 | Echo-ARENA | Live-stream: https://collegerama.tudelft.nl/Mediasite/Channel/eemcs-cs-distinguished-speaker-lectures-cs-dsl/watch/677db5f2fba84c948db6dcee8835513a1d
Abstrat
In recent years, we have witnessed an explosion in our capacity to collect and catalog vast amounts of data about our environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making structured data available on the Web and in various repositories and data lakes. Combined with advances in analytics and machine learning, the availability of such data should, in theory, allow us to make progress on many of our most important scientific and societal questions.
However, this opportunity is often unrealized due to a central technical barrier: it remains nearly impossible for domain experts to sift through the overwhelming amount of available information to discover datasets they need for their specific applications. While search engines have addressed the discovery problem for Web documents, supporting the discovery of structured data presents new challenges. These include crawling the Web in search of datasets, indexing datasets and supporting dataset-oriented queries, creating new techniques to rank and display results.
In this talk, I will discuss these challenges and present our recent work in this area. Specifically, I will describe strategies for finding relevant datasets on the web and deriving metadata to be indexed. Additionally, I will introduce a new class of data-relationship queries and outline a collection of methods that efficiently support various types of relationships, demonstrating how they can be used for data explanation and augmentation. Finally, I will showcase Auctus, an open-source dataset search engine that we have developed at the NYU Visualization, Imaging, and Data Analysis (VIDA) Center. I will conclude by highlighting open problems and suggesting directions for future research.