Building Resilient Data Pipelines
Gianluca Demartini (University of Queensland)
20 June 2023, 14:00-15:30 | ECHO-ARENA | https://collegerama.tudelft.nl/Mediasite/Channel/eemcs-cs-distinguished-speaker-lectures-cs-dsl/watch/b4dd6ed0bee547d39436150a6a1e4a871d
Abstrat
When the goal is to build robust data pipelines, the quality of the data we use is key. While big data often provides sufficient information for large models to learn from, issues of unbalanced, incomplete, or incorrect data may lead to critical errors.
In this talk I will discuss recent research conducted by my group studying human annotator behaviour and its implications on the quality of the collected data and the bias in it. I will first discuss how human bias is reflected in the data which is being collected by means of crowdsourcing and the consequences of unbalanced data for machine learning models. Then, I will present our work making use of fine-grained behavioural logs and eye-tracking to better model data curators and human annotators. Finally, I will present examples of biased labels and their impact on ML classification decisions.