18 septembre 2018

Catégorie : Post-doctorant

Supervisors:

- Saïd Ladjal, LTCI

Télécom Paristech

46 Rue Barrault,

75013 Paris

said.ladjal@telecom-paristech.fr

01 45 81 81 45 - Alasdair Newson, LTCI

Télécom Paristech

46 Rue Barrault,

75013 Paris

alasdair.newson@telecom-paristech.fr

01 45 81 73 82 - Guillaume Charpiat, INRIA Saclay

Guillaume.Charpiat@inria.fr

bureau 2054, bât. Claude Shannon (autres noms : 660, "Digitéo"), Orsay

Tél : 01 69 15 39 91

Lab: LTCI

Digicosme group: Réseaux Profonds et Représentations Distribuées

Teams involved: IMAGES (LTCI), TAU (INRIA Saclay)

Duration: one year

Application deadline: 15 October 2018

The subject of this project is to explore the use of deep learning for the purposes of image synthesis. A central component of image synthesis and restoration algorithms is knowledge about the particular properties of natural images, so that meaningful models and regularisations can be established. However, this is a very difficult goal to achieve by “hand-designing” such models (Tikhonov regularisation, total variation, patch-based GMM models [19]), as was common until recently. A more flexible and powerful approach is to train neural networks for these restoration and synthesis tasks, more particularly convolutional neural networks (CNNs). Examples of these networks include denoising networks, auto-encoders and generative adversarial networks (GANs). A common, underlying theme of these networks is the idea that natural images lie on some lower-dimensional, latent, space. A core objective of this project is to design a convolutional neural network (CNNs) which is able to capture the underlying

structure of the image data which we are analysing, in a robust and generalisable manner. Several central questions appear here:

- Can the neural networks learn the underlying structure of the space on which natural images lie?
- Can the networks learn how to project both to and from that space?
- What architectures and/or regularisations are necessary to ensure good generalisation capacities of the network?
- Is it possible to fit workable, possibly parametric, probabilistic models to the latent space?

If this crucial objective is attained, a wide array of synthesis possibilities are opened up, from texture synthesis and interpolation to image denoising. A specific application of this work would be image texture synthesis. In this context, we would like to create a network which is able to transform to and from a space where our data is represented in a manner which is amenable to probabilistic modelling.

The relevance of these research directions is argued in the following.

Generative networks are currently a very hot topic, due to the spectacular image synthesis possibilities that they offer for a wide variety of complex and abstract objects. They have been used, for example, by Zhu et al. [18] and Ha and Eck [7] for the purpose of producing examples of images of specific objects with only a rough sketch as a guide to the algorithm. Once again, the key question here is how to discover the underlying manifold of these images. In the work of Zhu et al, this is done using a GAN, whose adversarial component seeks to distinguish between “real” and “false” images. Once this is achieved, it is possible to navigate in this space, interpolating between image examples. Very often, this is done by simple linear interpolation (as in Zhu et al. [18]), even if there is no guarantee that the space is indeed linear. Furthermore, there remain serious questions as to the generalisation capacity of GANs, in other words, to what extent can GANs interpolate in a data region which was unobserved in the training data.We illustrate this in Figure 1 by showing the results of the DCGAN trained on a database of disks with certain radii which are not observed in the database. The DCGAN learns to produce the disks it has seen, but cannot synthesise unobserved data.

An existing approach that encourages smooth latent spaces, and thus better generalisation, is the contractive auteoncoder [14], which encourages robustness of the latent space to small changes in the network input. We shall first investigate such an approach in the context of image synthesis and generation. However, if we consider that the data is well parameterised by the latent space, then the output should also vary smoothly w.r.t any small changes in the parametrisation. Thus, a first approach, similar to that of Rifai et al., would be to

- Minimise the `2 norm of the Jacobian of the network output w.r.t the code z,

which specifies that the output should not greatly change as the code moves a small amount. We note here that while different regularisation techniques play a central role in modern architectures, it is rare to find work which analyses the generalisation capabilities of the network in a clear, controlled manner ; the only criterion is successful classification rates. In the case of image synthesis, it is important that the latent space should be meaningful, so that we can rely on its generalisation capabilites and, ultimately, produce meaningful interpolations in the latent space. For this, we propose to study examples where the underlying space of the images is known, and parametrisable. This approach is quite uncommon in the literature, and we consider that it will deliver new insights into generalisation and interpolation, or at the very least provide a minimum performance requirement for generative networks (ie correctly finding the latent space of images with known parametrisations, in a sufficeiently robust manner).

If this generalisation problem is sufficiently well addressed, we shall also investigate the best approaches to interpolation itself. Some very recent work has been done, which proposes alternatives to simple linear interpolation [16, 10], and we shall investigate these avenues linked with finding geodesics in the latent space. However, contrary to these approaches, we also propose again to study cases where the parametrisation is known, in order to design autoencoders and interpolation techniques that provide meaningful latent-space interpolation.

If these goals are attained, we hope that it will be possible to attain

- Robust, generalisable generative networks for image synthesis
- Meaningful interpolation in the resulting latent spaces.

An interesting application of such a network is texture synthesis, which we discuss now. This entails several additional challenges, such as modeling the latent space with a probailistic model, and sampling textures of arbitrary sizes.

In the specific case of texture synthesis, a recent approach proposed by Gatys et al. [5] is to iteratively modify an input noise such that the response of the different filters of a CNN share some statistics with the responses to the example texture. The authors search for a local minimum of the following energy, starting from a random point:

x := argmin (Cov((x)) - Cov((y)))2 2 ; (1)

x

where is the network and y is the example texture. Minimisation is achieved using a complex and tuned version of the LBFGS algorithm. This approach provides very impressive results, however many artifacts remain on a detailed level, and it is very slow due to the optimsiation required. Ulyanov et al. [17] propose a similar, accelerated, approach, however with the same visible artefacts. Another method, now popular in image synthesis, is to employ a GAN, which learns a generating and an adversarial network, to produce synthesised textures [3, 8]. However, these GANs have several well-known drawbacks, such as being difficult to train, and subject to “mode collapse” [15]. They also yield relatively poor texture images [11]. Another useful network type is the auto-encoder, and more specifically the variational autoencoder [9] (VAE), which proposes to transform the input image to a lowerdimensional space which is then modelled by a multivariate Gaussian distribution. In reality, this network only

learns the parameters of the Gaussian, and the latent code is then produced by sampling from the chosen distribtion, which in no way ensures that the underlying data is actually well-modelled by the distribution. This poses a significant problem for texture synthesis (variability is not necesserily ensured).

A common flaw of all these approaches is a lack of a useable probabilistic model in the latent space. Ideally, we would like to have a network which is able to transform to and from a latent space to which we can fit a probabilistic, possibly parametric, model. If this goal is obtained, texture synthesis, and indeed a whole host of applications, can be reformulated and improved upon. We propose several approaches to address these problems:

- Use learned invertible non-linear transformations, such as in the work of Ballé et al. [2] who specifically constrain the latent space to be Gaussian.
- Employ recent work on the Wasserstein GANs [1, 6] which specifically address the aforementioned problems of mode collapse in GANs.

A challenge to overcome in the first approach is that the network of Ballé et al. maintains the same dimensionality of the input data in the latent space, which does not encourage the network to uncover the optimal image latent space.

In short, our ideal goal is a network which can transform to and from a lower-dimensionality latent space, as is the case with the auto-encoder architecture, but which also induces a space which is well-modelled by a parametric probabilistic model, as typefied by the work of Ballé et al. If this goal is achieved, texture synthesis can be carried out by sampling in the latent space. However, sampling textures in a robust manner and of any size, is in itself a challenge. For this, we propose to explore two approaches:

- Treat each channel of the latent space as a gaussian field, apply a random phase to this signal and then regenerate the corresponding image using the inverse transformation, as in the work of Galerne et al. [4]
- Synthesise and stitch patches together in the latent space in a coherent manner. For this, the work of Raad et al. [12] concerning coherent patch tiling could be beneficial.

[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv :1701.07875, 2017.

[2] J. Ballé, V. Laparra, and E. P. Simoncelli. Density modeling of images using a generalized normalization transformation. CoRR, abs/1511.06281, 2015.

[3] U. Bergmann, N. Jetchev, and R. Vollgraf. Learning texture manifolds with the periodic spatial gan. arXiv preprint arXiv :1705.06566, 2017.

[4] B. Galerne, Y. Gousseau, and J.-M. Morel. Random phase textures : Theory and synthesis. IEEE Trans. Image Process., 20(1) :257 – 267, 2011.

[5] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015.

[6] A. Genevay, G. Peyré, and M. Cuturi. Gan and vae from an optimal transport point of view. arXiv preprint arXiv :1706.01807, 2017.

[7] D. Ha and D. Eck. A neural representation of sketch drawings. CoRR, abs/1704.03477, 2017.

[8] N. Jetchev, U. Bergmann, and R. Vollgraf. Texture synthesis with spatial generative adversarial networks. arXiv preprint arXiv :1611.08207, 2016.

[9] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.

[10] S. Laine. Feature-based metrics for exploring the latent space of generative models. 2018.

[11] L. Raad, A. Davy, A. Desolneux, and J. Morel. A survey of exemplar-based texture synthesis. CoRR, abs/1707.07184, 2017.

[12] L. Raad, A. Desolneux, and J.-M. Morel. Conditional Gaussian Models for Texture Synthesis, pages 474–485. Springer International Publishing, Cham, 2015.

[13] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv :1511.06434, 2015.

[14] V. P. M. X. G. X. . B. Y. Rifai, S. Contractive auto-encoders : Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning, 2011.

[15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.

[16] H. Shao, A. Kumar, and P. T. Fletcher. The riemannian geometry of deep generative models. arXiv preprint arXiv :1711.08014, 2017.

[17] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky. Texture networks : Feed-forward synthesis of textures and stylized images. In ICML, pages 1349–1357, 2016.

[18] Zhu, Jun-Yan and Krähenbühl, Philipp and Shechtman, Eli and Efros, Alexei A. Generative Visual Manipulation on the Natural Image Manifold. In Proceedings of European Conference on Computer Vision (ECCV), 2016.

(c) GdR 720 ISIS - CNRS - 2011-2018.