In recent years, scene parsing has received increased attention from the research community, culminating with the emergence of international challenges aiming to drive research in a more systematic way, such as the challenge hosted by the MIT (Scene Parsing Challenge http://sceneparsing.csail.mit.edu/results2016.html). Traditionally, scene parsing is seen as a labeling task: each pixel is attributed the classof an object it belongs to. Recent state-of-the-art approaches use Fully Convolutional Networks (FCN)  with Conditional Random Fields  for probabilistic inference and deconvolution layers for upsampling. On the other hand, object detection and recognition based on convolutional networks and transfer learning have now matured into reliable tools successfully applied to a wide range of practical scenarios [3, 4].
Inthis PhD thesis we propose to go beyond this framework and approach scene understanding from the structural point of view:parse images into structured semantic object hierarchiesincludingobject interaction relations in a way that is in agreement with human interpretation. This is motivated by the needs of a variety of decision-making scenarios where spatial or structural understanding of the scene is necessary (e.g. autonomous agents, surveillance, information aggregation, etc.). Thisrequires tocombine information from detection, segmentation and recognition tasks such as to produce a globally consistent representation. One of the challenges is that the information necessary for coherent scene interpretation may come from distant sites in the image. Thus, context is essential for the correct identification of elementary components, as well as the ability to perform multi-scale information transfer.
We intend to investigate in at least two directions:
Top-down or/and bottom-up analysis and reconstruction of visual scenes by using (hierarchical) sequences of object detectors. The aim is to use transfer learning to support “object A is a part of object B” type of description but also to incorporate local geometric analysis to produce a combined geometric and topological description of the image content.
Neural network architectures that emphasize the more general idea of “cumulative evidence” for object detection and recognition: while component subnetworks should be triggered/activated by the presence of local patterns in the input, the sequence of decision should be based on the idea that an object is a non local entity with geometric constraints and structural consistency (for example, the recent CapseNets architecture models the object as a bag of vector features and poses).
This PhD subject is expected to be financed in the CIFRE framework (http://www.anrt.asso.fr/fr/cifre-7843), the partners involved are the XXII company and the CEDRIC-Cnam lab.
Founded in 2015, XXII is a start-up specializing in deep technologies, focusing on 2 activities:
XXII DIGITAL SOLUTIONS – An innovative, interactive and immersive content.
XXII AI – An intelligence provider through a marketplace of artificial intelligence solutions for security, retail, smart cities and industrial purposes.
With 40+ talented and passionate PhDs, engineers, developers and creatives working in France, China and the US, XXII’s mission is to build tomorrow’s tools to augment Humans and their senses.
As an innovative AI startup, XXII is collaborating with the CEDRIC-Cnam in this PhD to create a fully automated image analysis and scene understanding process.
The Centre d’Etudes et de Recherche en Informatique et Communications (CEDRIC) is an academic laboratory that is part of the Cnam (Conservatoire National des Arts et Métiers, http://www.cnam.fr) and is located in the middle of Paris. The domains of the 160 researchers belonging to 7 teams cover broad areas of computer science, signal processing and statistics. Several permanent members and PhD students in two teams of the lab are working on deep learning and image processing.
The successful candidate should have a good knowledge of machine learning and more specifically deep learning, as well as general knowledge of image processing methods. She/he should be familiar with at least one deep learning framework.
E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, pp. 640–651, Apr. 2017.
Zheng et al., “Conditional random fields as recurrent neural networks”, ICCV 2015.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition”, ICCV 2016.
G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks”, CVPR 2017.
(c) GdR 720 ISIS - CNRS - 2011-2018.