# Réunion

## Théorie du deep learning

**Date :**28-06-2021

**Lieu :**Visio-conférence

**Thèmes scientifiques :**

- A - Méthodes et modèles en traitement de signal
- T - Apprentissage pour l'analyse du signal et des images

Nous vous rappelons que, afin de garantir l'accès de tous les inscrits aux salles de réunion, **l'inscription aux réunions est gratuite mais obligatoire**.

### Inscriptions

282 personnes membres du GdR ISIS, et 262 personnes non membres du GdR, sont inscrits à cette réunion.

Capacité de la salle : 600 personnes.

### Annonce

En dépit du succès actuel du deep learning, les garanties théoriques associées à ces modèles décisionnels restent fragiles. Ces questions posent des enjeux majeurs pour la communauté en traitement du signal et des images.

L'objectif de cette journée est de faire un état des lieux des avancées récentes pour l'analyse formelle du fonctionnement des réseaux de neurones profonds. Nous lançons un appel à communication sur l'analyse théorique du deep learning, et centré sur les thèmes (non exhaustifs) suivants :

- Théorie de la généralisation (double descente, régularisation implicite, réseaux sur-paramétrisés)
- Robustesse décisionnelle (incertitude, stabilité, propriétés Lipschitz)
- Expressivité des modèles, compression des réseaux

### Orateurs invités

- Mikhail Belkin, Univerity of California San Diego
- Gérard Biau, Sorbonne Université, Sorbonne Center for Artificial Intelligence (SCAI)

### Appel à contributions

Les personnes souhaitant présenter leurs travaux à cette journée sont invitées à envoyer, par e-mail, leur proposition (titre et résumé de 1 page maximum) aux organisateurs **avant le 31 Mai 2021**.

### Organisateurs

- Caroline Chaux (caroline.chaux@univ-amu.fr), Université Aix-Marseille, I2M
- Valentin Emiya (valentin.emiya@lis-lab.fr), Université Aix-Marseille, LIS
- François Malgouyres (Francois.Malgouyres@math.univ-toulouse.fr), Institut de Mathématiques de Toulouse (IMT, CNRS UMR 5219)
- Nicolas Thome (nicolas.thome@cnam.fr), Cnam Paris
- Konstantin Usevich (konstantin.usevich@univ-lorraine.fr), CRAN, Nancy

### Programme

- Session 1 : 10h-12h
- Session 2 : 16-18h

### Programme

### Résumé des contributions

#### Existence, Stability And Scalability Of Orthogonal Convolutional Neural Networks by El-Mehdi Achour (Institut de Mathématiques de Toulouse)

Imposing orthogonal transformations between layers of a neural network has been considered for several years now. This facilitates their learning, by limiting the explosion/vanishing of the gradient; decorrelates the features; improves the robustness. In this framework, this paper studies theoretical properties of orthogonal convolutional layers.

More precisely, we establish necessary and sufficient conditions on the layer architecture guaranteeing the existence of an orthogonal convolutional transform. These conditions show that orthogonal convolutional transforms exist for almost all architectures used in practice.

Recently, a regularization term imposing the orthogonality of convolutional layers has been proposed. We make the link between this regularization term and orthogonality measures. In doing so, we show that this regularization strategy is stable with respect to numerical and optimization errors and remains accurate when the size of the signals/images is large. This holds for both row and column orthogonality.

Finally, we confirm these theoretical results with experiments, and also empirically study the landscape of the regularization term.

This is a joint work with François Malgouyres and Franck Mamalet.

#### A Neural Tangent Kernel Perspective of GANs by Jean-Yves Franceschi (Sorbonne Université, LIP6)

Generative Adversarial Networks (GANs; Goodfellow et al., 2014) have become a canonical approach to generative modeling as they produce realistic samples for numerous data types, with a plethora of variants (Wang et al., 2021). Much effort has been put in gaining a better understanding of the training process, with a particular focus on studying GAN loss functions to conclude about their comparative advantages. Yet, empirical evaluations (Lucic et al., 2018; Kurach et al., 2019) have shown that different GAN formulations can yield approximately the same performance regardless of the chosen loss. This indicates that by focusing exclusively on the formal loss function, theoretical studies might not model practical settings adequately.

In particular, the discriminator being a trained neural network is not taken into account, nor are the corresponding inductive biases which might considerably alter the generator?s loss landscape. Moreover, neglecting this constraint hampers the analysis of gradient-based learning of the generator on finite training sets, since the gradient from the associated discriminator is ill-defined everywhere. These limitations thus hinder the potential of theoretical analyses to explain GAN's empirical behaviour.

In this work, leveraging the recent developments in the theory of deep learning driven by Neural Tangent Kernels (NTKs; Jacot et al., 2018), we provide a framework of analysis for GANs incorporating explicitly the discriminator?s architecture which comes with several advantages.

First, we prove that, in the proposed framework, under mild conditions on its architecture and its loss, the trained discriminator has strong differentiability properties; this result holds for several GAN formulations and standard architectures, thus making the generator?s learning problem well-defined. This emphasizes the role of the discriminator?s architecture in GANs trainability.

We then show how our framework can be useful to derive both theoretical and empirical analyses of standard losses and architectures. We highlight for instance links between Integral Probability Metric (IPM) based GANs and the Maximum Mean Discrepancy (MMD) given by the discriminator?s NTK, or the role of the ReLU activation in GAN architectures.

This is a cowork by Jean-Yves Franceschi, Emmanuel de Bézenac, Ibrahim Ayed, Mickaël Chen, Sylvain Lamprier, Patrick Gallinari.

- Slides: cedric.cnam.fr/~thomen/recherche/ISIS/DL-Theory-21/Francesci-A_Neural_Tangent_Perspective_of_GANs__GdR_ISIS_.pdf
- Video:

#### Achieving robustness in classification using optimal transport with hinge regularization by Mathieu Serrurier (IRIT)

Adversarial examples have pointed out Deep Neural Networks vulnerability to small local noise. It has been shown that constraining their Lipschitz constant should enhance robustness, but make them harder to learn with classical loss functions. We propose a new framework for binary classification, based on optimal transport, which integrates this Lipschitz constraint as a theoretical requirement. We propose to learn 1-Lipschitz networks using a new loss that is an hinge regularized version of the Kantorovich-Rubinstein dual formulation for the Wasserstein distance estimation. This loss function has a direct interpretation in terms of adversarial robustness together with certifiable robustness bound. We also prove that this hinge regularized version is still the dual formulation of an optimal transportation problem, and has a solution. We also establish several geometrical properties of this optimal solution, and extend the approach to multi-class problems. Experiments show that the proposed approach provides the expected guarantees in terms of robustness without any significant accuracy drop. The adversarial examples, on the proposed models, visibly and meaningfully change the input providing an explanation for the classification.

This is a cowork by Mathieur Serrurier, Franck Mamalet , Thibaut Boissin, Louis Bethune

- Slides: cedric.cnam.fr/~thomen/recherche/ISIS/DL-Theory-21/Serrrurier_lipschitz_GDR_Isis-1.pdf
- Video:

#### Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification by Gianni Franchi (ENSTA)

Bayesian Neural Networks (BNNs) have been long considered an ideal, yet unscalable solution for improving the robustness and the predictive uncertainty of deep neural networks. While they could capture more accurately the posterior distribution of the network parameters, most BNN approaches are either limited to small networks or rely on constraining assumptions, e.g., parameter independence. These drawbacks have enabled prominence of simple, but computationally heavy approaches such as Deep Ensembles, whose training and testing costs increase linearly with the number of networks. In this presentation, I will introduce an efficient deep BNN that can manage complex computer vision architectures, e.g., ResNet50 DeepLabV3+, and tasks, e.g., semantic segmentation, with fewer assumptions on the parameters.We achieve this by leveraging variational autoencoders (VAEs) to learn the interaction and the latent distribution of each network layer's parameters.

The approach that I will present, Latent-Posterior BNN (LP-BNN), is compatible with the recent BatchEnsemble method, leading to highly efficient (in terms of computation and memory during both training and testing) ensembles. LP-BNNs attain competitive results across multiple metrics in several challenging benchmarks for image classification, semantic segmentation, and out-of-distribution detection.

Preprint: https://arxiv.org/abs/2012.02818

#### Towards Building a Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks by Umut Simsekli (Inria)

In this talk, I will focus on the 'tail behavior' in SGD in deep learning. I will first empirically illustrate that heavy tails arise in the gradient noise (i.e., the difference between the stochastic gradient and the true gradient). Accordingly I will propose to model the gradient noise as a heavy-tailed ?-stable random vector, and accordingly propose to analyze SGD as a discretization of a stochastic differential equation (SDE) driven by a stable process. As opposed to classical SDEs that are driven by a Brownian motion, SDEs driven by stable processes can incur ?jumps?, which force the SDE (and its discretization) transition from 'narrow minima' to 'wider minima', as proven by existing metastability theory and the extensions that we proved recently. These results open up a different perspective and shed more light on the view that SGD 'prefers' wide minima. In the second part of the talk, I will focus on the generalization properties of such heavy-tailed SDEs and show that the generalization error can be controlled by the Hausdorff dimension of the trajectories of the SDE, which is closely linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of "capacity metric?. Finally, I will talk about the 'originating cause' of such heavy-tailed behavior and present theoretical results which show that heavy-tails can even emerge in very sterile settings such as linear regression with iid Gaussian data.

The talk will be based on the following papers:

U. Simsekli, L. Sagun, M. Gürbüzbalaban, "A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks", ICML 2019

T. H, Nguyen, U. ?im?ekli, M. Gürbüzbalaban, G. Richard, "First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise", NeurIPS 2019

U. Simsekli, O. Sener, G. Deligiannidis, M. A. Erdogdu, "Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks", NeurIPS 2020

M. Gurbuzbalaban, U. Simsekli, L. Zhu, "The Heavy-Tail Phenomenon in SGD", arXiv, 2020

#### Tensor-based approaches for learning flexible neural networks by Yassine Zniyed (Université de Lorraine)

Despite excellent prediction performance, state-of-the-art neural network architectures are very large, up to several millions of weights. In particular, running them on systems with limited compu- tational capacity (embedded systems) becomes a difficult task. For this reason, several works focused on the compression of NNs.

Most popular tensor approaches [1] for compression mainly aim at compressing the layers of convolutional networks, which can be viewed as tensors. By a canonical polyadic decomposition (CPD), they replace multidimensional convolutions by one-dimensional ones. Another direction of research is focused on relating tensor decompositions to neural networks with product units (instead of summing units) [2]; this type of representations, however, is not so much used in practice.

In this talk, we consider an entirely different approach. While keeping the traditional neural network structure (linear weights + nonlinear activation functions), we aim at adding flexibility [3] to activation functions (AFs), as opposed to fixed AFs used conventionally. In particular, the activation functions are allowed to be different in different nodes (as opposed to fixed functions, e.g. ReLu, in conventional architectures). Such architecture is particularly interesting thanks to identifiability (uniqueness) theory available in the polynomial case [4]. Identifiability properties may provide insight into the functioning of these NNs and help to enforce stability of the representation.

Unlike existing methods for flexible AFs that are using conventional training techniques [3], we employ an original framework developed in nonlinear system identification community. The work of [5] showed that an architecture with one hidden flexible layer can be identified as a CPD of a Jacobian tensor. However, it is not directly applicable in the learning setup; in particular, there is no simple way to estimate the activation functions. In this work we propose a new method for compression of pretrained neural networks based on coupled matrix-tensor factorization. The proposed learning algorithm is based on a constrained alternating least squares (ALS) approach. Our method allows for a good compression of large NN layers, with a slight degradation of the classification accuracy.

References

[1] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, Speeding-up convolutional neural networks using fine-tuned CP-decomposition, ICLR 2015, arXiv:1412.6553 (2015).

[2] N. Cohen, O. Sharir, and A. Shashua, On the Expressive Power of Deep Learning: A Tensor Analysis, in 29th Annual Conference on Learning Theory, New York, USA, 2016, pp. 698?728.

[3] A. Apicella, F. Donnarumma, F. Isgro`, and R. Prevete, A survey on modern trainable activation functions, Neural Networks, 138 (2021), pp. 14?32.

[4] P. Comon, Y. Qi, and K. Usevich, Identifiability of an x-rank decomposition of polynomial maps, SIAM Journal on Applied Algebra and Geometry, 1 (2017), pp. 388?414.

[5] P. Dreesen, M. Ishteva, and J. Schoukens, Decoupling multivariate polynomials using first-order information and tensor decompositions, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 864?879.

[6] Y. Zniyed, K. Usevich, S. Miron, and D. Brie, Learning nonlinearities in the decoupling problem with structured CPD, in 16th IFAC Symposium on System Identification, Padova, Italy, 2021.

This is a cowork by Yassine Zniyed, Konstantin Usevich, Sebastian Miron, David Brie.

#### Fixed support matrix factorization is NP hard by Tung Le Quoc (ENS de Lyon)

In this work, we contribute to the theory of neural network optimization with sparsity constraints. Consider a classical feed forward neural network where one would like to minimize the loss function while enforcing the sparsity of network (i.e, the matrix at each layer contains many zeros). Current works have to deal with two sub-problems simultaneously: finding the positions of nonzero indices (known as supports) at each layers and their corresponding coefficients. While the former can be expected to be combinatorial in nature, we show that the second one is also NP-hard even in the case of a shallow linear network (i.e. no bias, linear activation function with two layers and training data (B, A) having B = Id ). Yet, for certain families of sparsity patterns, we show that the problem becomes tractable with an efficient algorithm that can be exploited for multilayer sparse factorization.

This is a cowork by Tung Le Quoc, Elisa Riccietti, Rémi Gribonval.

#### Analyzing the identifiability of sparse linear networks by Léon Zheng (ENS de Lyon)

Sparsity in deep neural networks is desired for reducing time and space complexity of the model. It can also be considered in the hope of making the learned model interpretable. In particular, such interpretability would require some kind of stability or identifiability of the parameters. Although identifiability is well-understood in linear inverse problems regularized by sparsity, things are different in the case of networks with multiple sparse layers. We study here identifiability of sparse linear neural networks with two layers, and show some consequences of the analysis to the multilayer case.

We give conditions under which the problem of factorizing a matrix into two sparse factors admits a unique solution, up to unavoidable equivalences. Our framework considers an arbitrary family of sparsity patterns, allowing us to capture more structured notions of sparsity than simply the count of nonzero

entries. These conditions are shown to be related to the uniqueness of exact matrix decomposition into rank-one matrices, with sparsity constraints. Simple sufficient conditions for identifiability can be derived from this framework, which are verified for instance by the DCT and DST matrices of size N = 2L, when enforcing N -sparsity by column on the left factor, and 2-sparsity by row on 2 the right factor. Our analysis can then be extended to the multilayer case, as we give a formal proof that the DFT matrix of size 2L admits a unique sparse factorization into L factors, when enforcing the butterfly supports as the sparsity constraints on the L factors.

This is a cowork by Tung Le Quoc, Rémi Gribonval, Elisa Riccietti.

#### From classical statistics to modern machine learning (invited talk) by Mikhail Belkin (UCSD)

"A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance.

As I will discuss in my talk, this apparent contradiction is key to understanding modern machine learning. While classical methods rely on the bias-variance trade-off where the complexity of a predictor is balanced with the training error, "modern" models are best described by interpolation, where a predictor is chosen among functions that fit the training data exactly, according to a certain inductive bias. Furthermore, classical and modern models can be unified within a single "double descent" risk curve, which extends the usual U-shaped bias-variance trade-off curve beyond the point of interpolation. This understanding of model performance delineates the limits of classical analyses and opens new lines of enquiry into computational, statistical, and mathematical properties of models. A number of implications for model selection with respect to generalization and optimization will be discussed.

#### Spherical Perspective on Learning with Batch Normalization by Simon Roburrin (LIGM)

Batch Normalization (BN) is a prominent deep learning technique. In spite of its apparent simplicity, its implications over optimization are yet to be fully understood. In this paper, we introduce a spherical framework to study the optimization of neural networks with BN layers from a geometric perspective. More precisely, we leverage the radial invariance of groups of parameters, such as filters for convolutional neural networks, to translate the optimization steps on the L2 unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, we use it to derive the first effective learning rate expression of Adam. Then we show that, in the presence of BN layers, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere. Finally, our analysis outlines phenomena that previous variants of Adam act on and we experimentally validate their importance in the optimization process.

This is a cowork by Simon Roburin, Yann de Mont-Marin, Andrei Bursuc, Renaud Marlet, Patrick Pérez, Mathieu Aubry