Les commentaires sont clos.

Soutenance de thèse de Nicolas Ballas

7 Novembre 2013

Catégorie : Soutenance de thèse

Soutenance de thèse de Nicolas Ballas :

Modélisation de contextes pour l’annotation sémantique de vidéos

La soutenance, en anglais, aura lieu le Mardi 12 Novembre à 14h aux Mines de Paris.

Mines ParisTech
60, boulevard Saint Michel


Françoise Prêteux et Bertrand Delezoide ont le plaisir de vous inviter à la soutenance de thèse de Nicolas Ballas dont le titre est :
Modélisation de contextes pour l’annotation sémantique de vidéos

La soutenance, en anglais, aura lieu le Mardi 12 Novembre à 14h aux Mines de Paris.

Mines ParisTech
60, boulevard Saint Michel

La soutenance sera suivie du traditionnel pot de thèse auquel vous êtes également invités.

Contact :

Jury :

  • Mme Cordélia Schmitt, Directeur de Recherche, INRIA, Président
  • M. Jean Ponce, Directeur de Recherche, ENS, Rapporteur
  • M. Georges Quénot, Directeur de recherche, CNRS, Rapporteur
  • M. Alexander Hauptman, Senior System Scientist, CMU, Examinateur
  • M. Josef Sivic, Chercheur, INRIA, Examinateur
  • M. Marcin Dietiniecky, Chercheur, CNRS, Examinateur
  • Mme Françoise Prêteux, Directeur adjoint, Mines-ParisTech, Directeur
  • M. Bertrand Delezoide, Ingénieur de Recherche, CEA-LIST, Co-directeur

Résumé :

Recent years have witnessed an explosion of multimedia contents available. In 2010 the video sharing website YouTube announced that 35 hours of videos were uploaded on its site every minute, whereas in 2008 users were “only” uploading 12 hours of video per minute. Due to the growth of data volumes, human analysis of each video is no longer a solution; there is a need to develop automated video analysis systems.

This thesis proposes a solution to automatically annotate video content with a textual description. The thesis core novelty is the consideration of multiple contextual information to perform the annotation. With the constant expansion of visual online collections, automatic video annotation has become a major problem in computer vision. It consists in detecting various objects (human, car...), dynamic actions (running, driving...) and scenes characteristics (indoor, outdoor...) in unconstrained videos. Progress in this domain would impact a wild range of applications including video search, video intelligent surveillance or human-computer interaction.

Although some improvements have been shown in concept annotation, it still re- mains an unsolved problem, notably because of the semantic gap. The semantic gap is defined as the lack of correspondences between video features and high-level human understanding. This gap is principally due to the concepts intravariability caused by photometry change, objects deformation, objects motion, camera motion or view- point change... To tackle the semantic gap, we enrich the description of a video with multiple contextual information. Context is defined as “the set of circumstances in which an event occurs”. Video appearance, motion or space-time distribution can be considered as contextual clues associated to a concept. We state that one context is not informative enough to discriminate a concept in a video. However, by considering several contexts at the same time, we can address the semantic gap.

More precisely the thesis major contributions are the following:

  • a novel framework that takes into consideration several contextual information: To benefit from mutiple contextual clues, we introduce a fusion scheme based on a generalize sparsity criteria. This fusion model automatically infers the set of relevent contexts for a given concept.
  • A feature inter-dependences context modeling: Different features capture complementary information. For instance, Histogram of Gradient (HoG) focuses on the video appearance while the Histogram of Flow (HoF) collects motion information. Most of the existing works capture different feature statistics independently. By contrast, we leverage their covariance to refine our video signature.
  • A concept-dependent modeling of space-time context: Discriminative information is not equally distributed in the video space-time domain. To identify the discriminative regions, we introduce a learning algorithm that determines the space-time shape associated to each individual concept.
  • An attention context modeling: We enrich video signatures with biological-inspired attention maps. Such maps allow to capture space-time contextual information while preserving the video signature invariance to the translation, rotation and scaling transformations. Without this space-time invariance, different concept instances with various localizations in the space-time volume can result in divergent representations. This problem is severe for the dynamic actions that have dramatic space-time variability.