As humans, we constantly rely on the sounds around us to get information about our environment (birds singing, a car passing passing, the constant hum from a highway nearby…) and to get feedback about our actions (the noise of a door closing, the bips from an ATM keyboard…). Ambient sound analysis aims at designing algorithms that can analyze and interpret sounds as a human would do. In particular, in sound event detection the goal is to detect not only the class of the sound events but also their time boundaries. Ideally this would rely on strongly labeled audio clips (with onset and offset timestamps) at training time but these are time consuming to obtain and prone to annotations errors. One alternative is to use either weakly labeled audio clips (that are cheap to annotate but do not provide timestamps), synthetically generated soundscapes with strong labels (that are cheap to obtain but can introduce a domain mismatch) or a combination of both. Since 2016 detection and classification of acoustic scenes and events (DCASE) Challenge proposes different tasks in ambient sound analysis ranging from acoustic scenes classification, source localization, anomaly detection to automated audio captioning. Since 2018, we are organizing a task on sound event detection with systems trained on a heterogeneous dataset composed of both recorded and synthetic soundscapes with varying levels of annotation. After each edition we have been analyzing the performance of the submissions in details in order to identify the remaining challenges in sound event detection, to evaluate the relevance of the existing metrics and adapt the task accordingly. During this talk I will present the lessons we learned on sound event detection during the past 4 years and the prospective challenges.
Romain Serizel received the Ph.D. degree in Engineering Sciences from the KU Leuven (Leuven, Belgium) in June 2011 working on multichannel noise reduction algorithms targeting applications for hearing impaired persons. He was then a postdoctoral researcher at KU Leuven (2011-2012), at FBK (Trento, Italy, 2013-2014) and at Télécom ParisTech (Paris, France, 2014-2016) where he worked on machine learning applied to speech processing and sound analysis. He is now an Associate Professor with Université de Lorraine (Nancy, France) doing research on machine listening and robust speech communications. He has co-authored 14 journal papers, about 40 articles in peer-reviewed conferences and 2 book chapters. Since 2018, he is the coordinator of DCASE challenge task 4 on “Sound event detection in domestic environments”. Since 2019, he is coordinating the DCASE challenge series together with Annamaria Mesaros.