PhD position: Neural networks for improving speech understanding in real-time embedded systems
14 Octobre 2022
Catégorie : Doctorant
Information about the position
- Position: PhD student
- Project: Embedded Artificial Intelligence lab, CEA List
- Location: France > Palaiseau (Paris region)
- Research Field: Artificial Intelligence for audio improvement
- Duration of Contract: 3 years
- Starting date: November 2022
- Fabrice Auzanneau (email@example.com)
Neural networks for improving speech understanding in real-time embedded systems
Worldwide, around 466 million people currently suffer from hearing loss. To remedy the loss of hearing sensitivity, portable hearing aids have been designed for almost a century. However, fifty percent of hearing impaired people who need hearing aids do not use them, mainly because of their poor efficiency in complex or noisy environments despite lengthy fitting sessions.
In particular, people suffering from Auditory Neuropathy Spectrum Disorders (ANSDs, 1 to 10% of adults with hearing loss) enjoy little or no benefit from current hearing aids. Contrary to regular hearing losses, The ANSDs form a continuum of hearing impairments due to synaptic or neuronal dysfunction in the peripheral and central parts of the auditory pathways. ANSDs impair the processing of temporal information without necessarily affecting auditory sensitivity. This can have a particularly dramatic impact in scenarios where the speech of interest is present together with some background noise or with one or several concurrent speakers.
Temporal cues are essential, not only for piecing together speech-related information in different frequency bands, but also for source separation in cocktail-party situations with spatially distributed, competing speech sources. Thus, the main need of ANSD subjects, shared with ageing subjects who experience central auditory-processing difficulties, is not to restore audibility but more to improve their speech perception by compensating for the deterioration of acoustic cues that rely on temporal precision. Conventional hearing aids mainly rely on sound amplification and focus on spectral enhancement to improve speech analysis, which is not suited here.
This PhD fits within the scope of the ANR project "REFINED" involving the Laboratory of Embedded Artificial Intelligence in CEA List in Paris (https://list.cea.fr/) , the Multispeech research team In LORIA, Nancy (https://team.inria.fr/multispeech/), and the Hearing Institute in Paris (https://www.institut-audition.fr/).
The project aims at studying new Deep Learning based methods to improve hearing acuity of ANSD patients. A cohort of ANSD volunteers will be tested to identify spectro-temporal auditory and extra-auditory cues correlated with the speech perception. Additionally, the benefits of neural networks will be studied. However, current artificial intelligence methods are too complex to be applied to processors with low computing and memory capacities: compression and optimization methods are needed.
Neural networks for audition: Speech transmission degrades quickly in complex acoustic scenes composed of several sound sources and simultaneous speakers. Since the 70s, a large body of work has been dedicated to extract the speech of interest from a complex mixture or to separate each individual source composing the mixture, in part for the benefit of persons with hearing losses. Recent years have seen new approaches involving machine learning (a branch of artificial intelligence), in particular Deep Neural Networks. Current trends include using neural networks to estimate filters or target signals directly [1, 2] or time-frequency masks that are used in turn to compute standard filters [3, 4].
Despite improvements in the performances of speech enhancement and source separation algorithms, the recent tendency of end-to-end approaches (that work directly on the waveform instead of relying on pre-computed time-frequency representations) leads to models that are far too complex to be used in embedded devices like hearing aids. Indeed, the increased complexity of these speech enhancement algorithms  have adverse consequences on the real time performance of the hearing device, often preventing them from being implemented within the device. For instance, the latency that can exist between the movements of the speaker's lips and the sound reproduced by the hearing aid is a major source of discomfort for the user.
Embedded AI: Implementing efficient real-time neural networks on constrained embedded systems with optimal accuracy requires rethinking the design, training, and deployment of models. A large body of literature has addressed these issues and efforts have been put on optimizing deep networks, mainly in two aspects:
- Designing more efficient network architectures which maintain reasonable accuracy with relative smaller model size, such as MobileNet and SqueezeNet,
- Reducing the network's size by means of compression or encoding.
Two of the most studied compression solutions are pruning and quantization. Pruning removes redundant parameters or neurons that do not contribute significantly to the accuracy of the results . Quantization refers to techniques for performing calculations and storing tensors at bit widths smaller than single-precision floating-point (32 bits). A quantized model performs some or all of the tensor operations with integers rather than floating-point values . This allows for a more compact representation of the model and the use of high-performance vectorised operations on many hardware platforms.
The goal of this PhD is to study the contribution of new attention-based models (such as transformers) to speech enhancement and to propose a systematic procedure to reduce the complexity of such algorithms sufficiently to allow for their embedding on portable devices while still meeting the quality estimation criterion defined through listening test with patients. The student will explore approaches such as transfer learning and knowledge distillation in order to reduce the model complexity while preserving the performance. He/she will also explore approaches enforcing sparsity in the models to facilitate effective pruning  and perform preliminary experiments based on quantization  to reduce overall model complexity. These latter approaches will introduce performance degradation. He/she will analyze the relation between complexity reduction and performance degradation in order to define complexity reduction approaches under signal / cues estimation performance.
The thesis work aims at exploring new AI-based methods to improve the hearing of ANSD patients. It addresses three main points:
- The use of AI for hearing is a fairly recent topic: this research addresses the study of various types of networks used in the state of the art, explore the relevance of new emerging networks such as Transformers based on the attention mechanism, and of transfer learning,
- The application to the special case of ANSD, which is different from enhancement and source separation , will be taken into account,
- In order to respect the constraints of an embedding design, such as low power, latency, small memory and computing size, this work will explore the relevance of different neural networks compression techniques such as quantization, knowledge distillation or other parameter reduction methods.
The profile of the candidate is an engineer or a university graduate with a very good record. He/she should have good knowledge in computer science and electronics and audio sigal processing, and experience in artificial intelligence.
The candidate must master computer development (C/C++).
Good oral and written skills in French and English.
About the CEA and LIST
The CEA (French Commission for Atomic and Renewable Energy) is a public research institute. It plays an important role in the research, development and innovation community. The CEA has four missions: security and defense, nuclear energy (fission and fusion), technology research for industry and fundamental research. With 16 000 employees, including technicians, engineers, researchers and support personnel, the CEA is involved in numerous research projects in collaboration with both academic and industrial partners.
In the section of the CEA focused on technology research for industry, the LIST institute is focused on intelligent digital systems. This institute has a culture of innovation and has as a mission to transfer these technologies to industrial partners. The DSCIN department specializes in complex digital and embedded systems for Artificical Intelligence (AI), High-Performance Computing (HPC) and Cyber security applications.
The focus of the LIAE laboratory is Embedded Artificial Intelligence, perception combined with multimodal sensors (including vision sensors). The lab is located in the Paris region (Palaiseau).
 Kolbaek, M. et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 25, 1901–1913 (2017).
 Luo, Y., Chen, Z., Mesgarani, N. & Yoshioka, T. End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation. arXiv:1910.14104 (2020).
 Furnon, N., Serizel, R., Illina, I. & Essid, S. Distributed speech separation in spatially unconstrained microphone arrays. in ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing (2021).
 Heymann, J., Drude, L. & Haeb-Umbach, R. Neural network based spectral mask estimation for acoustic beamforming. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 196–200 (2016). doi:10.1109/ICASSP.2016.7471664.
 Berlin, C. I. et al. Multi-site diagnosis and management of 260 patients with auditory neuropathy/dys-synchrony (auditory neuropathy spectrum disorder). Int J Audiol 49, 30–43 (2010).
 Blalock, D. et al. What is the State of Neural Network Pruning? Proceedings of Machine Learning and Systems 2020 (MLSys 2020) https://arxiv.org/abs/2003.03033
 Gholami, A. et al. A Survey of Quantization Methods for Efficient Neural Network Inference. https://arxiv.org/abs/2103.13630