The MAESTRIA project (Multi-modAl Earth obServaTion Image Analysis)
aims to solve the methodological challenges related to the fully
automatic analysis of the massive amount of images acquired by Earth
Observation platforms. MAESTRIA targets to generate land-cover and
land-use descriptions at country scale at many spatial resolutions and
sets of classes. The ultimate goal is to provide a continuum of
spatially and semantically consistent products, that are relevant for
many end-users and applications. Both public policies at local or
national levels and scientific models will benefit from such kinds of
products for climate modelling, urban planning, crop monitoring or
impact assessment of surface changes.
The output of the MAESTRIA project will be two-fold: (i) methods that
leverage current challenges in Earth Observation image analysis; (ii)
a large range of precise and up-to-date land-cover maps available over
very large scales from 2m to 100m. Both will be made freely available
so as to stimulate research and commercial services built upon such
The current PhD position integrates in and is funded by the MAESTRIA
The PhD work is dedicated to the fusion of heterogeneous information
coming from different satellite sensors in order to improve the
accuracy and semantic richness of the produced land cover maps.
The concept of data cube has been introduced in order to
efficiently deal with huge amounts of Earth Observation (EO)
multi-temporal data in the mono-modal case (e.g., the Landsat
archive). The data cube is the database where the images are stored
and can be queried by date, geographical location and spectral band.
In the case of mono-modal imagery, the data ingestion is
straightforward. In the case of multi-modal imagery, even when these
data are relatively similar (Landsat and Sentinel-2, for instance) the
current approach consists in resampling all data to the highest
resolution, either at ingestion time or at query time. When the data
is to be used in classification tasks, this strategy only allows for a
stack and classify approach (feature fusion) or for the
independent classification of image modalities followed by decision
fusion. Both approaches limit the ability of the classifier to fully
exploit the multi-modal aspect of the data
When data from different sensors of the same modality are available
(optical multi-spectral sensors having different spectral bands or
different spatial resolutions) super-resolution or disaggregation
techniques based on the underlying physics and sampling phenomena are
used: pan-sharpening techniques are the most widespread. These
approaches have been extended for the cases where the resolution
ratios are large, like the case of MODIS (1km every day) and Landsat
(30m every 16 days). Nevertheless, these approaches are limited to
modalities having the same physics and are usually sensor-specific.
In MAESTRIA, a new pivotal representation of the multi-modal data
will be developed in order to minimize the loss of information with
respect to the original data: a set of common variables to all
modalities sampled at 10m resolution and daily revisit. Two main
approaches will be developed in parallel: one based on (1) physical
approaches (models of the landscapes and the measuring mechanisms)
and the other one based on purely (2) statistical approaches. We
will pay special attention to the possibility of cross-pollination of
the two approaches.
The PhD work focusses on the statistical approaches and will be
carried out in coordination with physical approaches.
- Master of Science (or equivalent) in Applied Mathematics, Computer Science or Machine Learning
- Good programming skills (C++, Python)
Candidates should send an e-mail to firstname.lastname@example.org containing:
1. Full CV
2. Letter of interest
3. Contact information for 2 references
The combination of similar sensors (Landsat-8/Sentinel-2) is usually performed by a simple temporal interleave of the acquisitions in order to build a richer time series. However, this yields limited improvements with respect to using each sensor independently (Inglada et al. 2015). A small improvement can be achieved by adapting pan-sharpening techniques (Fasbender, Radoux, and Bogaert 2008). Higher resolution ratios to target a daily revisit cycle have been studied (Gao et al. 2006) and then improved for the case of Sentinel-2 and Sentinel-3 (Yin, Inglada, and Osman 2012). Finally, the spatial context of the pixels can also be introduced in order to improve the quality of the gap-filling procedure. However, this sums up to reconstructing the observation of an improved sensor, that is, a better image, not a better characterization of the surfaces: images are just proxies of the physical magnitudes we are interested in.
Actually, the characterization of the observed surfaces is made by their bio-/geo-physical properties. There is a wealth of literature about the estimation of such variables using different types of sensors (Hirooka et al. 2015; Inoue, Sakaiya, and Wang 2014): moisture, roughness, photosynthetic activity, leaf area index, radiation absorption, above-ground biomass, etc. Models exist for the estimation of these magnitudes from different kinds of sensors with different amounts of error.
The possibility of estimating the same bio-/geo-physical variable from different sensors shows that a common representation can be defined for multi-modal data. But why limiting this representation to fixed variables? In the same way that hand crafted features tend to be replaced by automatic feature extraction (in particular deep-based architectures), one can propose to automatically find the common optimal representation of multi-modal data}.
Up to now, the automatic feature extraction of multi-modal data has been done either in separate networks for each image modality or by stacking all input data and performing feature extraction for the whole data set. The original proposition here is to base this common representation on latent variables extracted from the data. Thus, we will build a generic data cube which can be fed with different amounts of data of the different modalities with the highest spatial resolution (10m) and a daily time step.
One straightforward solution is to adopt the encoder/decoder architecture of auto-encoders: the input of the encoder is one sensor and the output of the decoder is the other sensor. This is similar to what is done in automatic machine translation. Instead of using classical AE (sparse or denoising AE (Tao et al. 2015)) whose parameters are difficult to tune and have convergence issues in high dimensions, Variational AE (Kingma and Welling 2013) and their ladder extension (C. K. Sønderby et al. 2016) will be used to explore the usefulness of deeper architectures. In order to take into account the spatial context of pixels and speckle noise in the case of SAR, the AE architecture will have to be enriched with convolutional (CNN) layers. The temporal dimension will be introduced through recurrent units (RNN) as GRU (Cho et al. 2014), which are themselves based on the encoder-decoder paradigm. The main methodological challenge resides in developing the spatio-temporal additions to the autoencoder architecture in a multi-modal context. Indeed, since more than 2 sensors are used and any combination of them can be available at any given time step, an architecture able to deal with missing input and output data will have to be developed, with new cost functions.
These generative models can also be used to predict the bio-/geo-physical variables of WP1.1 and therefore provide a unified framework for the 2 representations of the data. Furthermore, their use in the dynamical assimilation can be explored (Walker et al. 2016).
One of the crucial issues in this work will be to provide compact or sparse representations. With this objective in mind, particular attention will be paid to find a set of latent variables with these properties.
The overall evaluation of the new algorithms will be first based on the classification accuracy of the maps using the obtained representations on image time series covering a full country as in (Inglada et al. 2017). An improvement is expected in areas with frequent cloud cover, thanks to the increase of available acquisitions. Furthermore, this improvement will keep the spatial accuracy of the sensor having the higher spatial resolution (not the case of methods in the literature). Scalability in terms of computation costs, stability of the performances in presence of missing data, and feature interpretability will also be key elements.
Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.” CoRR. http://arxiv.org/abs/1409.1259v2.
Fasbender, D., J. Radoux, and P. Bogaert. 2008. “Bayesian Data Fusion for Adaptable Image Pansharpening.” IEEE Transactions on Geoscience and Remote Sensing 46 (6). Institute of Electrical; Electronics Engineers (IEEE): 1847–57. doi:10.1109/tgrs.2008.917131.
Gao, Feng, J. Masek, M. Schwaller, and F. Hall. 2006. “On the Blending of the Landsat and Modis Surface Reflectance: Predicting Daily Landsat Surface Reflectance.” IEEE Transactions on Geoscience and Remote Sensing 44 (8). Institute of Electrical; Electronics Engineers (IEEE): 2207–18. doi:10.1109/tgrs.2006.872081.
Hirooka, Yoshihiro, Koki Homma, Masayasu Maki, and Kosuke Sekiguchi. 2015. “Applicability of Synthetic Aperture Radar (Sar) to Evaluate Leaf Area Index (Lai) and Its Growth Rate of Rice in Farmers’ Fields in Lao Pdr.” Field Crops Research 176 (May). Elsevier BV: 119–22. doi:10.1016/j.fcr.2015.02.022.
Inglada, Jordi, Marcela Arias, Benjamin Tardy, Olivier Hagolle, Silvia Valero, David Morin, Gérard Dedieu, et al. 2015. “Assessment of an Operational System for Crop Type Map Production Using High Temporal and Spatial Resolution Satellite Optical Imagery.” Remote Sensing 7 (9): 12356–79. doi:10.3390/rs70912356.
Inglada, Jordi, Arthur Vincent, Marcela Arias, Benjamin Tardy, David Morin, and Isabel Rodes. 2017. “Operational High Resolution Land Cover Map Production at the Country Scale Using Satellite Image Time Series.” Remote Sensing 9 (1): 95. doi:10.3390/rs9010095.
Inoue, Yoshio, Eiji Sakaiya, and Cuizhen Wang. 2014. “Capability of c-Band Backscattering Coefficients from High-Resolution Satellite Sar Sensors to Assess Biophysical Variables in Paddy Rice.” Remote Sensing of Environment 140 (January). Elsevier BV: 257–66. doi:10.1016/j.rse.2013.09.001.
Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational Bayes.” CoRR. http://arxiv.org/abs/1312.6114v10.
Sønderby, Casper Kaae, Tapani Raiko, Lars Maalø e, Søren Kaae Sønderby, and Ole Winther. 2016. “Ladder Variational Autoencoders.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 3738–46. Curran Associates, Inc. http://papers.nips.cc/paper/6275-ladder-variational-autoencoders.pdf.
Tao, Chao, Hongbo Pan, Yansheng Li, and Zhengrou Zou. 2015. “Unsupervised Spectral & Spatial Feature Learning with Stacked Sparse Autoencoder for Hyperspectral Imagery Classification.” IEEE Geoscience and Remote Sensing Letters 12 (12): 2438–42. doi:10.1109/lgrs.2015.2482520.
Walker, Jacob, Carl Doersch, Abhinav Gupta, and Martial Hebert. 2016. “An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders.” Lecture Notes in Computer Science. Springer International Publishing, 835–51. doi:10.1007/978-3-319-46478-7_51.
Yin, Tiangang, Jordi Inglada, and Julien Osman. 2012. “Time Series Image Fusion: Application and Improvement of Starfm for Land Cover Map and Production.” 2012 IEEE International Geoscience and Remote Sensing Symposium, July. Institute of Electrical; Electronics Engineers (IEEE). doi:10.1109/igarss.2012.6351559.
(c) GdR 720 ISIS - CNRS - 2011-2018.