Direct coarticulation modelling from real-time midsagittal MRI images of the vocal tract
Application deadline June 5th, 2019 (Midnight Paris time)
MultiSpeech, INRIA Nancy Grand-Est,
Yves Laprie (Yves.Laprie@loria.fr)
Our long term objective is to achieve articulatory synthesis of speech, i.e. the generation of the acoustic signal by simulating the production of the speech signal by a human being.
In order to keep this problem affordable, we do not consider the bio-mechanical phenomena involved in the movement of speech articulators (jaw, tongue, lips, soft palate, larynx and epiglottis). Indeed, the number of muscles involved, their complex organization, the lack of maturity of numerical models applied to muscles and the lack of data make numerical simulations too far from real speech.
We thus only consider the temporal geometry of the vocal tract, the aero-acoustic phenomena, and the vocal fold activity. The advantage is that there exist minimally invasive measuring devices that allow access to the shape of the vocal tract (Magnetic Resonance Imaging) and the activity of the vocal folds (ElectroPhotoGlottoGraphy).
The vocal tract shape, and especially its temporal evolution, has to be modeled so as to provide the numerical acoustic simulations with the relevant geometry at each time point of the synthesis. The shape changes according to the positions of the speech articulators over time. The articulators move continuously, and the speaker must anticipate the positions to be reached in order to produce the desired sounds.
A speech sound is thus not produced independently of the surrounding sounds. Coarticulation covers the influence of the surrounding sounds on the current sound to be articulated. It should be noted that an articulator that is not critical for the production of a sound, i.e. has not acoustical impact, can anticipate its position for the coming sounds. For instance, during the production of /ipu/ the tongue is not recruited by the production of /p/ and thus can anticipate the position required by /u/ well before the acoustic onset of the vowel.
The quantitative prediction of the coarticulation effects is a challenging task. One of the first numerical models was proposed by Öhman  and consists of superimposing the effect of the consonants onto the trajectories followed by the articulators between two consecutive vowels. Despite its simplicity, this model is still used for its ease of implementation and relatively good results.
The overlapping of coordinated gestures corresponding to critical articulatory variables (for example the glottal aperture, labial protrusion and aperture, the place and degree of constriction of the tongue tip or body…) is a key element of articulatory phonology. Attempts to calculate gestures from speech and articulatory data  are always based on simplifying assumptions so strong that they severely limit the scope of the results.
The approach proposed by Cohen and Massaro  relies on the idea of finding the influence domain and the coarticulatory effects of each phoneme. These two sets of parameters are trained from a corpus for each phoneme and articulatory parameter. The main weakness of learning coarticulatory effects independently for each articulator is that there is no overall consistency which is required to achieve the correct acoustic target.
Acoustic-to-articulatory inversion for recovering the geometrical position of a small set of flesh points from the acoustic signal  also incorporates some non-explicit coarticulation modeling. Deep learning methods that require big corpora of ElectroMagnetoArticulography data associating the position of sensors glued onto articulators for training are now widely used to tackle this problem. However, only “easily accessible” articulators (because sensors have to be glued) are considered. The vocal tract is therefore not taken into account in its entirety, and additionally those approaches are unable to involve a true aero-acoustic dimension.
The objective of this work is to train a coarticulation model that covers all the articulators and guarantees that the target sounds can be generated.
Since this year the IADI laboratory with which we have been collaborating for many years has been equipped with a real-time MRI data acquisition system (at 50 Hz) that allows us to monitor the evolution of the midsagittal shape of the vocal tract during speech production.
This represents a considerable asset in the perspective of studying and modeling coarticulation for several speakers.
The work proposed consists of exploiting these data and is organized in two stages.
The first will consist of tracking articulators in MRI data. Unlike several approaches which process the complete vocal tract as a single object we want to track each articulator independently because even if their movements are coordinated, they are not necessarily synchronized. The fact of connecting all the articulator contours in one general contour from the glottis to the lips thus prevents the coarticulation from being studied at the level of articulators. We already drawn articulatory contours in about a thousand images, and preliminary tests we carried out show that this enables fairly good results for the tongue. The objective is to implement a deep-learning auto-encoding approach which, in a first step learns the image and the associated contour for the images with outlined contours, and in a second step retrains the first hidden layers without the contours so as to enable the reconstruction of the contours, and thus tracking, without their prior knowledge [4,5,6].
The second step will be devoted to the modeling of coarticulation via deep learning techniques by identifying the role of each articulator in order to integrate the phenomena of acoustic compensation between articulators.
 S.E.G. Öhman. Numerical model of coarticulation. J. Acoust. Soc. Am., 41:310–320, 1967.
 H. Nam, V. Mitra, M. Hasegawa-Johnson, C. Epsy-Wilson, E. Saltzman, and L. Goldstein. A procedure for estimating gestural scores from speech acoustics. Journal of the Acoustical Society of America, 132(6):3080–3989, 2012.
 M.M. Cohen and D.W. Massaro, Modeling Coarticulation in Synthetic Visual Speech, In Models and Techniques in Computer Animation, Springer, 1993
 Uria, Benigno & Renals, Steve & Richmond, Korin. (2011). A Deep Neural Network for Acoustic-Articulatory Speech Inversion. Proceedings NIPS, 2011.
 A. Jaumard-Hakoun, K. Xu, P. Roussel, G. Dreyfus, M. Stone and B. Denby. Tongue contour extraction from ultrasound images based on deep neulral network. Proc. of International Congress of Phonetic Sciences, Glasgow, 2015.
 I. Fasel and J. Berry. Deep Belied Networks for Real-Time Extraction of Tongue Contours from Ultrasound During Speech. Proc. of 20th ICPR, Istanbul, 2010.
 G. Litjens, T. Kooi et al. A survey on deep learning in medical image analysis. Medical Image Analysis, 42 :60-88, 2017.
Upload your file on jobs.inria.fr in a single pdf or zip file, and send it as well by email to Yves.Laprie@loria.fr. Your file should contain the following documents:
CVincluding a description of your research activities (2 pages max) and a short description of what you consider to be your best contributions and why (1 page max and 3 contributions max); the contributions could be theoretical or practical. Web links to the contributions should be provided. Include also a brief description of your scientific and career projects, and your scientific positioning regarding the proposed subject.
The report(s) from your PhD external reviewer(s), if applicable.
If you haven't defended yet, the list of expected members of your PhD committee (if known) and the expected date of defense (the defense, not the manuscript submission).
In addition, at least one recommendation letter from your PhD advisor should be sent directly by their author(s) toYves.Laprie@loria.fr
Applications are to be sent as soon as possible.
PhD in computer science or acoustics. Knowledge about speech processing and speech production is a decisive plus.
French or English.
(c) GdR 720 ISIS - CNRS - 2011-2018.