In this month, the laboratory will be pleased to hold its fifth semi-annual workshop and steering committee.

Programme

9:00 Welcome & coffee

9:25 Introduction

9:30 – 10:30 Gaël Richard “Hybrid and Interpretable deep neural audio processing”

Abstract, We will describe and illustrate a novel avenue for explainable and interpretable deep audio processing based on Hybrid deep learning. This paradigm refers here to models that associates data-driven and model-based approaches in a joint framework by integrating our prior knowledge about the data in simple and controllable models. In the speech or music domain, prior knowledge can relate to the production or propagation of sound (using an acoustic or physical model), the way sound is perceived (based on a perceptual model), or for music for instance how it is composed (using a musicological model). In this presentation, we will first illustrate the concept and potential of such model-based deep learning approaches and describe in more details its application to unsupervised singing voice separation, speech dereverberation and symbolic music generation.

Bio, Gaël Richard received the State Engineering degree from Telecom Paris, France in 1990, the Ph.D. degree and Habilitation from University of ParisSaclay respectively in 1994 and 2001. After the Ph.D. degree, he spent two years at Rutgers University, Piscataway, NJ, in the Speech Processing Group of Prof. J. Flanagan, where he explored innovative approaches for speech production. From 1997 to 2001, he successively worked for Matra, Bois d’Arcy, France, and for Philips, Montrouge, France. He then joined Telecom Paris, where he is now a Full Professor in audio signal processing. He is also the scientific co- director of the Hi! PARIS interdisciplinary center on AI and Data analytics for society. He is a coauthor of over 250 papers and inventor in 10 patents. His research interests include topics such as signal representations, source separation, machine learning methods for audio/music signals and music information retrieval. He received, in 2020, the Grand prize of IMT-National academy of science for his research contribution in sciences and technologies. He is a fellow member of the IEEE and the past Chair of the IEEE SPS Technical Committee for Audio and Acoustic Signal Processing. In 2022, he is awarded of an advanced ERC grant of the European Union for the project “HI-Audio: Hybrid and Interpretable deep neural audio machines”.

10:30 – 11:15 ADASP PhD Presentations

  • Victor Letzelter, David Perera, « Multiple choice learning for audio scene analysis »
  • Morgan Buisson, « Music Structure Analysis with Edge-Conditioned Graph Attention Networks »,
  • Xuanyu Zhuang, « Episodic fine-tuning prototypical networks for optimization-based few-shot learning: application to audio classification »

11:15 Coffee break

11:30 – 12:30 Neil Zeghidour, “Audio Language Models”

Abstract,Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio comprehension as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, and even text-to-music generation. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.

Bio, Neil is co-founder and Chief Modeling Officer of the Kyutai non-profit research lab. He was previously at Google DeepMind, where he started and led a team working on generative audio, with contributions including Google’s first text-to-music API, a voice preserving speech-to-speech translation system, and the first neural audio codec that outperforms general-purpose audio codecs. Before that, Neil spent three years at Facebook AI Research, working on automatic speech recognition and audio understanding. He graduated with a PhD in machine learning from Ecole Normale Supérieure (Paris), and holds an MSc in machine learning from Ecole Normale Supérieure (Saclay) and an MSc in quantitative finance from Université Paris Dauphine. In parallel with his research activities, Neil teaches speech processing technologies at the École Normale Supérieure (Saclay).

 

12:30 Lunch break

14:00 Steering committee