Title: Context-aware affective Human behaviour analysis for far-range Robot scene understanding
Starting date: October 1st 2021
Application deadline date: April 30th 2021
Decision announcement date: May 28th 2021
******** Keywords Videos analysis, context-aware analysis, valence & arousal evaluation, expression analysis, action detection and classification, multimodal fusion, signal & image processing, deep learning.
This PhD thesis is granted by the ANR project µDialBot (Jan. 2021 - Dec. 2024) and is proposed by the Image team of the Hubert Curien laboratory (LabHC - https://laboratoirehubertcurien.univ-st-etienne.fr/en/teams/image-sciencecomputer-vision.html). µDialBot consortium is composed of 5 partners (LabHC included): the laboratoire d’Informatique d’Avignon (LIA), the INRIA’s Perception Team in Grenoble, the Lusage Leaving Lab of Hôpital Brocat, APHP, and ERM Automatismes. In µDialBot, our ambition is to actively incorporate human-behavior cues in spoken human-robot communication. We intend to reach a new level in the exploitation of the rich information available with audio and visual data flowing from humans when interacting with robots. In particular, extracting highly informative verbal and non-verbal perceptive features will enhance the robot’s decision-making ability such that it can take speech turns more naturally and switch between multi-party/group interactions and face-to-face dialogues where required. Recently there has been an increasing interest in companion robots able to assist people in their everyday life and to communicate with them. These robots are perceived as social entities and their utilities for healthcare and psychological well being for the elderly have been acknowledged by several recent studies. Patients, their families and medical professionals appreciate the potential of robots, provided that several technological barriers would be overcome in the near future, most notably their ability to move, see and hear in order to more naturally communicate with people, well beyond touch screens and voice commands. The scientific and technological results of the project will be implemented onto a commercially available social robot and they will be tested and validated with several use cases in a day-care hospital unit.
More generally, one of the challenges of µDialBot consists in developing novel techniques for human behavior understanding (HBU) using audio and visual data. The social-robot scenarios that are addressed in the project require both far-range (3 to 5 meters) and close-range (1 to 3 meters) HBU, as well as the development of learning methods for robot control. In particular, the robot should be able to learn how to select a group of people that requires its assistance and decide to navigate towards the selected group to eventually engage face-to-face communication. The goal of the PhD is to focus on the context-aware affective behavior analysis in the far range context. Typically, observing a group of persons in a waiting room from about 3 to 5 meters, the objective is to determine, for each person its emotional state to evaluate its ability to communicate with the robot. In many works, the emotional state is determined based on facial expressions in relatively close range. Some works are considering images taken in the wild, but they also focus on the face [2, 4]. However, in real situations, the face is not always visible and furthermore a lot of other elements of the scene can be used to evaluate the emotional state such as gesture, person pose, interactions with other persons or objects of the scene,... This particular situation addresses an emerging research question which can be referred as context-aware emotion recognition . Most state-of-the-art methods perform on this topic but they mainly rely on low level visual features without explicitly integrating person pose or interactions with the other persons of the scene . The originality of our approach will be to consider a context at the whole scene scale integrating multiple persons and to use visual information provided by person pose, head gaze, gesture and inter-person interactions in a first study, and to add audio information if needed. A possible approach could be to use multi-stream neural networks to extract features in each source of information and to combine them in a classification or regression module . Another important question to address will be to evaluate the confidence of the inferred emotional state knowing that all sources are not always available. LabHC has a strong experience in expression analysis, gesture analysis and deep-learning through previous PhD thesis already defended. We will also benefit from the experience of the Consortium in multi-person 3D pose estimation, head gaze estimations, ...
We are looking for a motivated student holding an engineer diploma or a Master degree (before the 1st of October 2021) in the field of ”computer science” with strong skills in computer vision and/or Machine Learning. A good background in software development (algorithmic, Matlab/Octave/Scilab or Python, ...) is expected.
******** Salary Net salary: around 1400 euros. Teaching activities are eventually possible (64 hours per year).
******** Application process Your application should include the following documents:
******** Contacts (alphabetic order):
 Chen Chen, Zuxuan Wu, and Yu-Gang Jiang. Emotion in context: Deep semantic feature fusion for video emotion recognition. In Proceedings of the 24th ACM international conference on Multimedia, pages 127–131, 2016.
 C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5562–5570, 2016.
 Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. Context-aware emotion recognition networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10143– 10152, 2019.
 Yong Li, Jiabei Zeng, Shiguang Shan, and Xilin Chen. Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transactions on Image Processing, 28(5):2439–2450, 2018.
 Mohammad Mahdi Kazemi Moghaddam, Ehsan Abbasnejad, and Javen Shi. Follow the attention: Combining partial pose and object motion for finegrained action detection. arXiv preprint arXiv:1905.04430, 2019.
(c) GdR 720 ISIS - CNRS - 2011-2020.