Vous êtes ici : Accueil » Kiosque » Annonce

Identification

Identifiant: 
Mot de passe : 

Mot de passe oublié ?
Détails d'identification oubliés ?

Annonce

24 mars 2021

Continuous Sign Language recognition for the design of a gestural server


Catégorie : Doctorant


 

1. Practical information

2. Context of the work

3. Objectives

3.1 Data

3.2 Video pre-processing : spatio-temporal representations of the signer

3.3 Sign language recognition

4 References

Keywords : computer vision, machine learning, sign languages, gesture and movement

Début : 1er septembre 2021


1. Practical information
Location of the work. The PhD candidate will be a member of two laboratories, Gipsa-Lab and LIMSI CNRS (Since January 1rst , 2021 the LIMSI CNRS has become LISN resulting from the merge between LRI and LIMSI). The location of the PhD work will be decided together with the candidate :
— Gipsa-Lab (Grenoble) 11 rue des Mathématiques, Grenoble Campus BP46, F-38402 Saint Martin D’Hères cedex. Team CRISSP : audiovisual speech processing/machine learning, with a special expertise on recognition and generation of the gesture-based communication system ”Cued-speech” for hearing-impaired people.
— LIMSI (Orsay), https ://www.limsi.fr Campus universitaire bât 507, Rue du Belvédère 91405 Orsay cedex. Teams AMI and ILES : image and video analysis, computer vision, machine learning, sign languages (corpora, models, recognition, synthesis)

Contacts : denis.beautemps@gipsa-lab.grenoble-inp.fr, thomas.hueber@gipsa-lab.grenoble-inp.fr, michele.gouiffes@limsi.fr, annelies.braffort@limsi.fr

Salary: 2135 euros per month (gross salary). A complementary teaching assistant position is also possible.

Required skills. The ideal candidate has a solid background in mathematics and computer science (Master or Engineering degree) with a specialization in computer vision/natural language processing/machine learning. The candidate will have to develop code in Python for data analysis and learning (Tensorflow frameworks, or pytorch). A good level of written English is required for writing the scientific articles and the thesis manuscript. A good level of oral English is required. Fluency in French would be appreciated but not required.

2. Context of the work.
This PhD topic is proposed within the framework of the Serveur Gestuel project, founded by Bpifrance (public-private partnership). The objective of the project is to provide Deaf people practicing sign language with the equivalent of a voice server for hearing people. The selected candidate will have the opportunity to carry out his/her thesis work in a multidisciplinary environment mixing academic research teams from LISN (ex-LIMSI) in Orsay and GIPSA-lab in Grenoble and two industrial partners from the consortium (the PhD candidate will interact with them on a regular basis). So :
— he/she will benefit from strong skills in vision, natural language processing and machine learning ;
— it will benefit from solid technical support to develop its prototypes and carry out user tests in real conditions ;
— he/she will have the opportunity to deepen his knowledge of the social, economic and technical environment of the Deaf community ;
— it will contribute to the heart of the project, which is the development of system able to automatically decode sign language into written text.
Sign languages (SL) are natural languages used by Deaf communities. Unlike vocal languages which are audio-phonatory, LS are visuo-gestural. They are also multimodal, in the sense that information are transmitted by different articulators (hands, arms, chest, shoulders, head, facial features, gaze) and their movements. Utterances in LS include several types of gestural units : lexical signs, which are conventional signs for which the form and meaning remain stable regardless of the context and which can be listed in a dictionary, complex structures built on the fly for illustrative purposes and which therefore cannot be listed in a dictionary, as well as a large number of movements having a linguistic role carried by articulators other than the hands. In addition, the discourse is structured in space, which is used to contextualizesign, to place objects or concepts, to create visual relationships between these entities. Thus, one cannot reduce a LS utterance to a simple sequence of signs which would have an equivalent in the vocal language. It should also be noted that the shape of the signs varies according to the transitions which precede and follow (co-articulation), and to linguistic constraints (spatialization). The automatic recognition of sign language is becoming of great interest in computer vision and machine learning communities. However most studies focus on non-spatial lexical signs or ignores co-articulation. Addressing those complex aspects of sign language is one the main goal of this PhD project.


3. Objectives
The objective of the thesis is to investigate deep learning methods for natural sign language recognition, which would better take co-articulation, spatialization and illustrative structures into account. This work will initially be based on the data, tools and knowledge developed at LIMSI [1,2,4,5]. The research work will focus on three axes : 1) building realistic datasets of sign language (in collaboration with industrial partners), 2) image processing/feature extraction/representation learning, 3) deep learning pipeline for decoding sign
language.


3.1 Data
Contrary to action recognition, sign languages recognition does not benefit from large datasets. First of all, there is no universal sign language, but different languages (French, American, German, Chinese etc). For each of them, corpora are built using various methodologies. The annotation of these videos is not always provided and, when available, the annotation rules vary according to the linguistic model used. Most public datasets propose either a set of isolated lexical signs in a given language, or very simple but unrealistic utterances (a subject, an action and an object), or domain-specific utterances (for example weather). In this thesis, we will rely on so-called ’natural’ sign language data corpora, in the sense that few constraints are imposed on the content and form of the utterances. Contrary to most SL datasets, these corpora contain both lexical signs and illustrative structures. In addition, the thesis will help expanding the corpora by leveraging public data and collecting data through the consortium and assess the impact of using synthetic data.


3.2 Video pre-processing : spatio-temporal representations of the signer


The raw videos will be transformed to compact spatio-temporal data, representing the signer and his articulators. These data will serve as training data for a deep learning models. From a signer representation made up of key points associated with the pose of the body and face of the signer produced by deep learning (OpenPose library [6]) and a set of dynamic characteristics calculated on these models, we were able to develop a first system allowing the recognition of a few signs, among the most common [7] (see Fig. 1). The hands analysis remains an issue due to the large number of degrees of freedom, motion blur due to their high speed. The most recent tools perform a handshape identification based on the appearance of the hands [6], but it notably lacks the orientation of the hands. The thesis will be an opportunity to exploit the latest advances in terms of finger detection in a video stream.

3.3 Sign language recognition
The core of the thesis will consist in designing machine learning pipelines for addressing the following objectives :
— Automatic sign spotting : this approach will allow querying a sign language database directly using a short SL video, without any textual entry ; this detection will automatically annotate videos to
enrich the training databases
— Automatic conversion of sign language sequences into written text : A special focus will be put on sequence-to-sequence models [8]. The problem of adapting a pre-trained model to a new signer ornew recording set-up (e.g. video device, recording environment, etc.) will also be addressed.

4 References
[1] V. Belissen, A. Braffort, M. Gouiffès. Experimenting the Automatic Recognition of Non-Conventionalized Units in Sign Language. Algorithms 2020, 13, p. 310.
[2] H. Chaaban, M.Gouiffès, A. Braffort. Towards an Automatic Annotation of French Sign Language Videos : Detection of Lexical Signs. CAIP 2019. 10.1007/978-3-030-29891-3-35.
[3] S. Matthes, T. Hanke, A. Regen, J. Storz, S. Worseck, E. Efthimiou, N. Dimou, A. Braffort, J. Glauert, E. Safar, Dicta-Sign – Building a Multilingual Sign Language Corpus, 5th Workshop on the Representation and Processing of Sign Languages : Interactions between Corpus and Lexicon, Istanbul, Turkey, ELRA, 2012 www.ortolang.fr, https ://hdl.handle.net/11403/dicta-sign-lsf-v2/v1.
[4] V. Belissen, A. Braffort, M. Gouiffès. Dicta-Sign-LSF-v2 : Remake of a Continuous French Sign Language Dialogue Corpus and a First Baseline for Automatic Sign Language Processing. LREC 2020
[5] H. Bull, A. Braffort, M. Gouiffès. MEDIAPI-SKEL - A 2D-Skeleton Video Database of French Sign Language With Aligned French Subtitles. LREC 2020
[6] Z. Cao and T. Simon, S-E Wei and Y. Sheikh. Real time Multi-Person 2D Pose Estimation using Part Affinity Fields. CVPR 2017, Honolulu, Hawaı̈.
[7] O Koller, H Ney, R Bowden Deep hand : How to train a CNN on 1 million hand images when your data is continuous and weakly labelled Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition 2016
[8] O. Koller, Quantitative Survey of the State of the Art in Sign Language Recognition, arxiv 2008.09918, https ://arxiv.org/abs/2008.09918, 2020

 

Dans cette rubrique

(c) GdR 720 ISIS - CNRS - 2011-2020.