Statistical learning with missing values
Thèmes scientifiques :
- A - Méthodes et modèles en traitement de signal
Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, l'inscription aux réunions est gratuite mais obligatoire.
50 personnes membres du GdR ISIS, et 12 personnes non membres du GdR, sont inscrits à cette réunion.
Capacité de la salle : 100 personnes.
Statistical learning with missing values
7th December 2021
Zoom session 2pm-6pm
The term "Missing data" is used when some variables of an observed vector are absent, raising challenging problems in signal and image processing, e.g., involving estimation theory, detection and classification. Handling missing data is a pitfall in statistical signal processing, machine learning and data analysis. To name a few applications where missing data has drawn significant attention, we can cite biomedical studies, chemometrics or remote sensing where missing values created by poor atmospheric conditions or sensor failure can dramatically hinder the understanding of the physical phenomenon under observation.
The proposed session will deal with recent advances and new challenges in the processing and analysis of signals or images affected by missing data. These approaches include likelihood-based approaches, Bayesian and multiple imputation approaches, inverse-probability weighting, pattern-mixture models, sensitivity analysis and approaches under nonignorable missingness, computational tools such as the Expectation-Maximization algorithm and the Gibbs sampler, etc.
Call for presentations
People wishing to present their work are invited to inform the organizers of their intention before Monday November 15, 2021, by sending an email to the organizers (a title, an abstract and the list of authors).
- Julie JOSSE, DR Inria
- Mathieu FAUVEL, CR CESBIO
- Mohammed Nabil EL KORSO (firstname.lastname@example.org)
- Yajing YAN (email@example.com)
- Jean Yves TOURNERET (firstname.lastname@example.org)
Résumés des contributions
Supervised learning with missing values
J. Josse (INRIA)
An abundant literature addresses missing data in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We study first the seemingly-simple case where the target to predict is a linear function of the fully-observed data. For such a case, we introduce the NeuMiss network which uses a new type of non-linearity: the multiplication by the mask. Then, we consider general regression model and study the validity of the impute then regress procedures.
Joint Classification and Reconstruction of Irregularly Sampled Satellite Multivariate Image Times Series
Alexandre Constantin (INRIA), Mathieu Fauvel (CESBIO) and Stéphane Girard (INRIA)
Recent satellite missions have led to a huge amount of Earth observation data, most of them being freely available. In such a context, satellite image time series have been used to study land use and land cover (LULC) information. Supervised classification model are commonly used to extract LULC at large (country) scale. However, multivariate optical time series, such as Sentinel-2 or Landsat ones, are provided with an irregular time sampling for different spatial locations because of the orbital path. Furthermore, images may contain clouds and shadows at random spatial and temporal location, resulting in missing values. Thus, preprocessing steps such as interpolation and smoothing techniques are usually required to properly classify such data with conventional machine learning techniques.
In this talk, a multivariate Gaussian process mixture model is proposed to address the irregular sampling and the multivariate nature of the satellite time-series. The proposed approach is able to deal with irregular temporal sampling and missing data directly in the classification process. The method complexity scales linearly with the number of pixels, making it amenable in large-scale scenarios. The multivariate Gaussian process mixture model allows both for the classification of time-series and the imputation of missing values. Experimental results on simulated and real SITS data illustrate the importance of taking into account the spectral correlation to ensure a good behavior in terms of classification accuracy and reconstruction errors.
Federated Expectation Maximization with heterogeneity mitigation and variance reduction
A. DIEULEVEUT (CMAP, Ecole Polytechnique), G. FORT (IMT, CNRS), E. MOULINES (CMAP, Ecole Polytechnique), G. ROBIN (LAMME, CNRS)
The Expectation Maximization (EM) algorithm is the default algorithm for inference in latent variable models. As in any other field of machine learning, applications of latent variable models to very large datasets make the use of advanced parallel and distributed architectures mandatory. In this talk, we will introduce FedEM, which is the first extension of the EM algorithm to the federated learning context. FedEM is a new communication efficient method, which handles partial participation of local devices, and is robust to heterogeneous distributions of the datasets. To alleviate the communication bottleneck, FedEM compresses appropriately defined complete data sufficient statistics. We also develop and analyze an extension of FedEM to further incorporate a variance reduction scheme. Numerical results will be presented to support our theoretical findings, as well as an application to federated missing values imputation for biodiversity monitoring. Finally, we will comment the finite-time complexity bounds we obtained for these federated EM algorithms.
Tracking Dynamic Low-Rank Approximations of Higher-Order Incomplete Streaming Tensors
L.T. Thanh (University of Orléans), K. Abed Meraim (University of Orléans), N.L. Trung (VNU University of Engineering and Technology, Vietnam) and A. Hafiane (University of Orléans)
In recent years, the demand for adaptive (online) processing has been increasing due to the fact that many applications generate a huge number of data streams over time. In parallel, missing data are ubiquitous and more and more common in the modern datasets. In this talk, we propose two new provable algorithms for tracking online low-rank approximations of higher-order streaming tensors in the presence of missing data. The first algorithm, dubbed adaptive CP decomposition (ACP), minimizes an exponentially weighted recursive least-squares cost function to obtain the tensor factors in an efficient way, thanks to the alternative minimization framework and the randomized sketching technique. Under the Tucker model, the second algorithm called adaptive Tucker decomposition (ATD), which is more flexible than the first one, first tracks the underlying low-dimensional subspaces overing the tensor factors, and then estimates the core tensor using a stochastic approximation. Both algorithms are fast and require a low computational complexity and memory storage. A unified convergence analysis is presented for ACP and ATD to justify their performance. Experiments indicate that the two proposed algorithms are capable of the adaptive tensor decomposition problem with competitive performance on both synthetic and real data.
Robust Gaussian mixture model for data imputation and anomaly detection - Application to crop monitoring
F. Moureta (TerraNIS, University of Toulouse), M. Albughdadia (TerraNIS), S. Duthoita (TerraNIS), D. Kouamé (University of Toulouse) , G. Rieua (TerraNIS) and J.-Y. Tourneret (University of Toulouse)
Remote sensing applications are generally affected by the presence of missing data. It is especially the case when the data are acquired using multi-spectral imagery satellites, which are sensitive to cloud coverage. To address this issue, we propose a robust imputation approach based on a Gaussian mixture model (GMM). The main originality of the proposed approach is to use outlier scores resulting from an outlier detection algorithm within the EM algorithm to 1) detect abnormal agricultural parcels and 2) have a robust parameter estimation of the GMM parameters. Experimental results conducted on rapeseed and wheat crops using Sentinel-1 and Sentinel-2 data show that GMM outperforms the other reconstruction strategies tested, and lead to better detection of anomalous crop development. Moreover, the robust GMM approach is particularly useful if the dataset is corrupted by strong outliers, e.g., coming from a different crop type than the analyzed parcels.
Robust low-rank covariance matrix estimation with missing values and application to classification problems
A. Hippert-Ferrera (L2S, CentraleSupelec), M. N. El Korso (University Paris Nanterre), A. Breloy (University Paris Nanterre) and G. Ginolhac (Savoie Mont Blanc University)
Missing values are inherent to real-world data sets. Statistical learning problems often require the estimation of parameters as the mean or the covariance matrix (CM). If the data is incomplete, new estimation methodologies need to be designed depending on the data distribution and the missingness pattern (i.e. the pattern describing which values are missing with respect to the observed data). This talk considers robust CM estimation when the data is incomplete. In this perspective, classical statistical estimation methodologies are usually built upon the Gaussian assumption, whereas existing robust estimation ones assume unstructured signal models. The former can be inaccurate in real-world data sets in which heterogeneity causes heavy-tail distributions, while the latter does not profit from the usual low-rank structure of the signal. Taking advantage of both worlds, a CM estimation procedure is designed on a robust (compound Gaussian) low-rank model by leveraging the observed-data likelihood function within an expectation-maximization (EM) algorithm. After a validation on simulated data sets with various missingness patterns, the interest the proposed procedure is shown for CM-based classification and clustering problems with incomplete data. Investigated examples generally show higher classification accuracies with a classifier based on robust estimation compared to the one based on Gaussian assumption and the one based on imputed data.