# Réunion

## Statistical learning with missing values

**Date :**7-12-2021

**Lieu :**Teams

**Thèmes scientifiques :**

- A - Méthodes et modèles en traitement de signal

Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, **l'inscription aux réunions est gratuite mais obligatoire**.

### Inscriptions

89 personnes membres du GdR ISIS, et 34 personnes non membres du GdR, sont inscrits à cette réunion.

Capacité de la salle : 200 personnes.

### Annonce

**Statistical learning with missing values**

7th December 2021 - Teams session 2pm-6pm

The term "Missing data" is used when some variables of an observed vector are absent, raising challenging problems in signal and image processing, e.g., involving estimation theory, detection and classification. Handling missing data is a pitfall in statistical signal processing, machine learning and data analysis. To name a few applications where missing data has drawn significant attention, we can cite biomedical studies, chemometrics or remote sensing where missing values created by poor atmospheric conditions or sensor failure can dramatically hinder the understanding of the physical phenomenon under observation.

The proposed session will deal with recent advances and new challenges in the processing and analysis of signals or images affected by missing data. These approaches include likelihood-based approaches, Bayesian and multiple imputation approaches, inverse-probability weighting, pattern-mixture models, sensitivity analysis and approaches under nonignorable missingness, computational tools such as the Expectation-Maximization algorithm and the Gibbs sampler, *etc.*

### Call for presentations

People wishing to present their work are invited to inform the organizers of their intention before Monday November 15, 2021, by sending an email to the organizers (a title, an abstract and the list of authors).

### Organizers

- Mohammed Nabil EL KORSO (m.elkorso@parisnanterre.fr)
- Yajing YAN (yajing.yan@univ-smb.fr)
- Jean Yves TOURNERET (jean-yves.tourneret@toulouse-inp.fr)

### Programme

**14h00-14h05:** Introduction

**14h05-14h40:** What?s a good imputation to predict with missing values? Marine Le Morvan (INRA), Julie Josse (INRA), Erwan Scornet (CMAP, Ecole Polytechnique), Gael Varoquaux (INRA)

**14h40-15h15:** Federated Expectation Maximization with heterogeneity mitigation and variance reduction, A. DIEULEVEUT (CMAP, Ecole Polytechnique), G. FORT (IMT, CNRS), E. MOULINES (CMAP, Ecole Polytechnique), G. ROBIN (LAMME, CNRS)

**15h15-16h00:** Joint Classification and Reconstruction of Irregularly Sampled Satellite Multivariate Image Times Series, Alexandre Constantin (INRIA), Mathieu Fauvel (CESBIO) and Stéphane Girard (INRIA)

**16h00-16h15:** Pause

**16h15-16h35:** Robust Gaussian mixture model for data imputation and anomaly detection - Application to crop monitoring, F. Moureta (TerraNIS, University of Toulouse), M. Albughdadia (TerraNIS), S. Duthoita (TerraNIS), D. Kouamé (University of Toulouse) , G. Rieua (TerraNIS) and J.-Y. Tourneret (University of Toulouse)

**16h35-16h55:** Tracking Dynamic Low-Rank Approximations of Higher-Order Incomplete Streaming Tensors, L.T. Thanh (University of Orléans), K. Abed Meraim (University of Orléans), N.L. Trung (VNU University of Engineering and Technology, Vietnam) and A. Hafiane (University of Orléans)

**16h55-17h15:** Robust low-rank covariance matrix estimation with missing values and application to classification problems, A. Hippert-Ferrera (L2S, CentraleSupelec), M. N. El Korso (University Paris Nanterre), A. Breloy (University Paris Nanterre) and G. Ginolhac (Savoie Mont Blanc University)

**17h15-17h35:** Hybrid generalized approximate message passing algorithm for the generalized linear model with correlated priors, Lélio Chetot (Inria), Malcolm Egan (Inria) and Jean-Marie Gorce (Inria)

**17h35-17h55:** Fast informed nonnegative matrix factorization for mobile sensor calibration, Farouk Yahaya (Univ. littoral, LISIC), Matthieu Puigt (Univ. littoral, LISIC), Olivier Vu thanh (Univ. littoral, LISIC), Gilles Delmaire (Univ. littoral, LISIC) and Gilles Roussel (Univ. littoral, LISIC).

**17h55-18h:** Conclusion

### Résumés des contributions

**What?s a good imputation to predict with missing values?**

**Marine Le Morvan (INRA), Julie Josse (INRA), Erwan Scornet (CMAP, Ecole Polytechnique), Gael Varoquaux (INRA)**

How to learn a good predictor on data with missing values ? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation is not needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Our experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in a finite-sample regime.

### Joint Classification and Reconstruction of Irregularly Sampled Satellite Multivariate Image Times Series

**Alexandre Constantin ****(INRIA)****, Mathieu Fauvel (CESBIO) and Stéphane Girard (INRIA)**

Recent satellite missions have led to a huge amount of Earth observation data, most of them being freely available. In such a context, satellite image time series have been used to study land use and land cover (LULC) information. Supervised classification model are commonly used to extract LULC at large (country) scale. However, multivariate optical time series, such as Sentinel-2 or Landsat ones, are provided with an irregular time sampling for different spatial locations because of the orbital path. Furthermore, images may contain clouds and shadows at random spatial and temporal location, resulting in missing values. Thus, preprocessing steps such as interpolation and smoothing techniques are usually required to properly classify such data with conventional machine learning techniques.

In this talk, a multivariate Gaussian process mixture model is proposed to address the irregular sampling and the multivariate nature of the satellite time-series. The proposed approach is able to deal with irregular temporal sampling and missing data directly in the classification process. The method complexity scales linearly with the number of pixels, making it amenable in large-scale scenarios. The multivariate Gaussian process mixture model allows both for the classification of time-series and the imputation of missing values. Experimental results on simulated and real SITS data illustrate the importance of taking into account the spectral correlation to ensure a good behavior in terms of classification accuracy and reconstruction errors.

### Federated Expectation Maximization with heterogeneity mitigation and variance reduction

**A. DIEULEVEUT (CMAP, Ecole Polytechnique), G. FORT (IMT, CNRS), E. MOULINES (CMAP, Ecole Polytechnique), G. ROBIN (LAMME, CNRS)**

The Expectation Maximization (EM) algorithm is the default algorithm for inference in latent variable models. As in any other field of machine learning, applications of latent variable models to very large datasets make the use of advanced parallel and distributed architectures mandatory. In this talk, we will introduce FedEM, which is the first extension of the EM algorithm to the federated learning context. FedEM is a new communication efficient method, which handles partial participation of local devices, and is robust to heterogeneous distributions of the datasets. To alleviate the communication bottleneck, FedEM compresses appropriately defined complete data sufficient statistics. We also develop and analyze an extension of FedEM to further incorporate a variance reduction scheme. Numerical results will be presented to support our theoretical findings, as well as an application to federated missing values imputation for biodiversity monitoring. Finally, we will comment the finite-time complexity bounds we obtained for these federated EM algorithms.

### Tracking Dynamic Low-Rank Approximations of Higher-Order Incomplete Streaming Tensors

**L.T. Thanh (University of Orléans), K. Abed Meraim (University of Orléans), N.L. Trung (VNU University of Engineering and Technology, Vietnam) and A. Hafiane (University of Orléans)**

In recent years, the demand for adaptive (online) processing has been increasing due to the fact that many applications generate a huge number of data streams over time. In parallel, missing data are ubiquitous and more and more common in the modern datasets. In this talk, we propose two new provable algorithms for tracking online low-rank approximations of higher-order streaming tensors in the presence of missing data. The first algorithm, dubbed adaptive CP decomposition (ACP), minimizes an exponentially weighted recursive least-squares cost function to obtain the tensor factors in an efficient way, thanks to the alternative minimization framework and the randomized sketching technique. Under the Tucker model, the second algorithm called adaptive Tucker decomposition (ATD), which is more flexible than the first one, first tracks the underlying low-dimensional subspaces overing the tensor factors, and then estimates the core tensor using a stochastic approximation. Both algorithms are fast and require a low computational complexity and memory storage. A unified convergence analysis is presented for ACP and ATD to justify their performance. Experiments indicate that the two proposed algorithms are capable of the adaptive tensor decomposition problem with competitive performance on both synthetic and real data.

### Robust Gaussian mixture model for data imputation and anomaly detection - Application to crop monitoring

**F. Moureta (TerraNIS, University of Toulouse), M. Albughdadia (TerraNIS), S. Duthoita (TerraNIS), D. Kouamé (University of Toulouse) , G. Rieua (TerraNIS) and J.-Y. Tourneret (University of Toulouse)**

Remote sensing applications are generally affected by the presence of missing data. It is especially the case when the data are acquired using multi-spectral imagery satellites, which are sensitive to cloud coverage. To address this issue, we propose a robust imputation approach based on a Gaussian mixture model (GMM). The main originality of the proposed approach is to use outlier scores resulting from an outlier detection algorithm within the EM algorithm to 1) detect abnormal agricultural parcels and 2) have a robust parameter estimation of the GMM parameters. Experimental results conducted on rapeseed and wheat crops using Sentinel-1 and Sentinel-2 data show that GMM outperforms the other reconstruction strategies tested, and lead to better detection of anomalous crop development. Moreover, the robust GMM approach is particularly useful if the dataset is corrupted by strong outliers, e.g., coming from a different crop type than the analyzed parcels.

### Robust low-rank covariance matrix estimation with missing values and application to classification problems

**A. Hippert-Ferrera (L2S, CentraleSupelec), M. N. El Korso (University Paris Nanterre), A. Breloy (University Paris Nanterre) and G. Ginolhac (Savoie Mont Blanc University)**

Missing values are inherent to real-world data sets. Statistical learning problems often require the estimation of parameters as the mean or the covariance matrix (CM). If the data is incomplete, new estimation methodologies need to be designed depending on the data distribution and the missingness pattern (i.e. the pattern describing which values are missing with respect to the observed data). This talk considers robust CM estimation when the data is incomplete. In this perspective, classical statistical estimation methodologies are usually built upon the Gaussian assumption, whereas existing robust estimation ones assume unstructured signal models. The former can be inaccurate in real-world data sets in which heterogeneity causes heavy-tail distributions, while the latter does not profit from the usual low-rank structure of the signal. Taking advantage of both worlds, a CM estimation procedure is designed on a robust (compound Gaussian) low-rank model by leveraging the observed-data likelihood function within an expectation-maximization (EM) algorithm. After a validation on simulated data sets with various missingness patterns, the interest the proposed procedure is shown for CM-based classification and clustering problems with incomplete data. Investigated examples generally show higher classification accuracies with a classifier based on robust estimation compared to the one based on Gaussian assumption and the one based on imputed data.

**Hybrid generalized approximate message passing algorithm for the generalized linear model with correlated priors**

**Lélio Chetot (Inria), Malcolm Egan (Inria) and Jean-Marie Gorce (Inria)**

A common problem arising in signal processing is to reconstruct a sparse signal x based on a lower dimensional observation obtained via linear mixing, often known as compressed sensing (CS). A popular family of algorithms for this problem adopt a Bayesian perspective; e.g., belief propagation, expectation propagation, and approximate message passing. One of the key challenges is the choice of an appropriate prior distribution, which often involves a latent variable model. In the case of group sparsity, each latent variable determines whether a group of elements of the signal x are non-zero and are assumed to be independent Bernoulli random variables.

A key limitation of the group sparse prior is that it does not capture moderate levels of correlation in the latent variables. This is a problem that arises in some applications, such as channel estimation in wireless communication systems. In this work, we address this problem by introducing a new model for the prior, where correlation in the latent variables iarises from a Gaussian copula. For the resulting generalized linear model, we develop an algorithm to reconstruct the signal x via approximate message passing. In particular, our algorithm falls in the family of hybrid generalized approximate message passing (HGAMP). We show that for a scenario motivated by wireless communication of sensor data, our model for the prior and corresponding HGAMP algorithm can significantly outperform other methods exploiting the group sparse prior.

**Fast informed nonnegative matrix factorization for mobile sensor calibration,**

**Farouk Yahaya (Univ. littoral, LISIC), Matthieu Puigt (Univ. littoral, LISIC), Olivier Vu thanh (Univ. littoral, LISIC), Gilles Delmaire (Univ. littoral, LISIC) and Gilles Roussel (Univ. littoral, LISIC).**

Air quality is usually monitored by a sparsely sampled network of authoritative and bulky sensors. Due to their high cost, only a few monitoring stations are deployed in each large city. As a consequence, considering miniaturized and mobile low-cost sensors to provide a finer spatial and temporal coverage is highly investigated. Unfortunately, these low-cost sensors tend to drift over time and thus require regular calibration, which cannot be done in-lab for obvious availability and cost considerations. To solve this issue, some data-driven techniques called "in-situ sensor calibration" were proposed. In particular, such a problem could be revisited as an informed matrix factorization problem with missing entries which jointly calibrates mobile low-cost sensors and can derive some air quality maps. Unfortunately, the proposed methods are slow to converge and cannot be applied to large-scale areas covered by hundreds of sensors. We thus propose several extensions of Dorffer et al. which will be introduced during this talk, i.e., (i) we extend the calibration model in Dorffer et al. to the case of arrays with cross-sensitive sensors, (ii) we propose several fast solvers to solve this informed problem. These solvers follow an Expectation-Maximization framework and combine the Nesterov gradient descent and accelerated structured random projections. Experiments on simulations show the relevance of the proposed methods.