In this month, the laboratory will be pleased to hold its fourth semi-annual workshop and steering committee.


9:00 Welcome & coffee

9:25 Introduction

9:30 – 10:30 Emmanouil Benetos, “Machine learning paradigms for music and audio understanding”

Abstract, The area of computational audio analysis -also called machine listening- continues to evolve. Starting from methods grounded in digital signal processing and acoustics, followed by supervised machine learning methods that require large amounts of labelled data, recent approaches for learning audio representations are fueled by advances in the broader field of artificial intelligence. The talk will outline recent research carried out at the Centre for Digital Music of Queen Mary University of London focusing on emerging learning paradigms for making sense of music and audio data. Topics covered will include learning in the presence of limited audio data, the inclusion of other modalities such as natural language to aid learning audio representations, and finally methods for learning from unlabelled audio data – with the latter being used as a first step towards the creation of music foundation models.

Bio, Emmanouil Benetos is Reader in Machine Listening, Royal Academy of Engineering / Leverhulme Trust Research Fellow, and Turing Fellow at Queen Mary University of London. Within Queen Mary, he is member of the Centre for Digital Music and Digital Environment Research Institute, is Deputy Director at the UKRI Centre for Doctoral Training in AI and Music (AIM), and co-leads the School’s Machine Listening Lab. His main area of research is computational audio analysis, also referred to as machine listening or computer audition – with applications to music, urban, everyday and nature sounds.

10:30 – 11:15 ADASP PhD Presentations

  • Antonin Gagneré et al. “Adapting Pitch-Based Self Supervised Learning Models For Tempo Estimation”
  • Manvi Agarwal et al. “Structure-Informed Positional Encoding For Music Generation”
  • Aurian Quelennec et al. “On The Choice Of The Optimal Temporal Support For Audio Classification With Pre-Trained Embeddings”

11:15 Coffee break

11:30 – 12:30 Thomas Pellegrini, “Deep Learning-Based Audio Event Classification and Description”

Abstract, Automatically detecting audio events already existed before the deep learning area, however, the emergence of deep neural networks coupled with the availability of extensive human-labeled datasets has opened new perspectives in the field. Recently, natural language has been introduced, particularly in the DCASE yearly challenges, through tasks such as audio captioning (AC) and audio-text retrieval (ATR), which involve describing audio recordings using free text and written sentences. In this talk, I will discuss our experiments addressing these tasks. I will start by describing our efforts to build convolutional neural networks that are competitive to transformers in audio tagging on AudioSet, by adapting the computer vision architecture ConvNeXt. I will introduce ConvNeXt-DCLS, our attempt to build dilated convolution layers that jointly learn network weights and the positions of non-zero elements within convolution kernels. I will present CoNeTTE, our audio captioning system aiming at handling multiple captioning datasets using dataset tokens. Finally, I will reflect on the limitations and challenges facing the field, particularly those associated with the datasets commonly employed by the community.

Bio,Thomas Pellegrini is an Associate Professor of Computer Science at Université Toulouse III Paul Sabatier and a researcher at Institut de Recherche en Informatique de Toulouse (IRIT). He leads the Master’s program “Interactions between computer science and mathematics for AI” (IMA). His research focuses on computational audio analysis, with a particular emphasis on audio event detection and child speech recognition, employing deep learning techniques, especially in scenarios characterized by limited annotated data.


12:30 Lunch break

14:00 Steering committee