---
title: "Normalizing diffusion kernels with optimal transport"
author: "Jean Feydy"
institute: ""
topic: ""
theme: "metropolis"
colortheme:
fonttheme:
fontsize: 11pt
linkcolor: black
urlcolor: blue
linkstyle: bold
aspectratio: 169
titlegraphic: 
logo:
date:
section-titles: false
toc: false
---



## Who am I?

Background in **mathematics** and **data sciences**:

```{=latex}
\vspace*{-.25cm}
\begin{description}
\setlength\itemsep{-.5em}
\item[2012--2016] ENS Paris, mathematics.
\item[2014--2015] M2 mathematics, vision, learning at ENS Cachan.
\item[2016--2019] PhD thesis in \textcolor{red!80!black}{\textbf{medical imaging}} with Alain Trouvé at ENS Cachan.
\item[2019--2021] \textcolor{red!80!black}{\textbf{Geometric deep learning}} with Michael Bronstein at Imperial College.
\item[2021+] \textcolor{red!80!black}{\textbf{Medical data analysis}} in the HeKA INRIA team (Paris).
\end{description}
\vspace*{.4cm}
```


## HeKA: a translational research team for public health

`\begin{center}`{=latex}

![](images/HeKA.pdf){height=7cm}

`\end{center}`{=latex}



## My main motivation

Develop **robust and efficient** software that $\BR{stimulates}$ **other researchers**:

1. Speed up $\BR{geometric}$ **machine learning** on GPUs: \
   $\Longrightarrow~~$ **pyKeOps** library for distance and kernel matrices, 900k+ downloads.

2. Scale up $\BR{pharmacovigilance}$ to the full French population: \
   $\Longrightarrow~~$ **survivalGPU**, a fast re-implementation of the R survival package.

3. Ease access to modern statistical $\BR{shape analysis}$: \
   $\Longrightarrow~~$ **GeomLoss**, truly scalable optimal transport in Python. \
   $\Longrightarrow~~$ **scikit-shapes**, beta release in January.



## A vessel map that preserves vessel lengths and curvatures `\cite{guillaume_vessel_maps}`{=latex}

`\begin{center}\vspace*{1cm}`{=latex}

![](images/vessel_maps/pelvis.pdf){height=6cm}

Our new visualization method, tailored to **endovascular** interventions.

`\end{center}`{=latex}



## A consistent theory of smoothness for diverse data structures

1. The $\BR{clean}$ method: **Laplacians** and heat **diffusions** \
    \

2. The $\BR{fast}$ method: **smoothing** with local averages \
    \

3. **Sinkhorn** $\BR{normalization}$ : **fast** smoothing $~\mapsto~$ **clean** diffusion \
    \

4. **Applications**


# Laplacians and heat diffusions


## The Laplace operator

`\vspace*{.5cm}`{=latex}

::: columns
:::: {.column width=25%}
`\begin{center}\vspace*{.8cm}`{=latex}

![](images/toy_graph.png){ height=6cm }

**Graph** with $5$ nodes. \
**Signals** $f$ are vectors \
$(f_0, f_1, f_2, f_3, f_4)$.

`\end{center}`{=latex}
::::
:::: {.column width=70%}
`\begin{center}`{=latex}

```{=latex}
\begin{equation*}
\|f\|^2_{\text{smooth}} ~=~ (f_0-f_1)^2 + \cdots + (f_0 - f_4)^2
\end{equation*}


\begin{equation*}
\delta ~=~
\begin{pmatrix}
+1 & -1 & & & \\
+1 & & -1 & & \\
+1 & & & -1 & \\
+1 & & & & -1
\end{pmatrix}
~~
\text{~~is~~ $n_\text{edges} \times n_\text{points}$}
\end{equation*}

\begin{equation*}
\|f\|^2_{\text{smooth}} ~=~ \|\delta f\|^2 ~=~ f^\top \underbrace{\delta^\top \delta}_\Delta f ~=~
f^\top \Delta f
\end{equation*}
```


`\end{center}`{=latex}
::::
:::


## Properties of the Laplace operator

By construction, the $\BR{Laplacian}$ $~~\Delta = \delta^\top \delta~$
is a $n_\text{points}\times n_\text{points}$ matrix that:

```{=latex}
\begin{tabular}{rcl}
i)  & $\Delta^\intercal = \Delta$ & is \textbf{symmetric} \\
ii) & $\Delta \mathbf{1} = 0$ & cancels \textbf{constant} signals  \\
iii) & $f^\top \Delta f \geqslant 0$  & is a \textbf{positive} operator \\
iv) & $i\neq j \Longrightarrow \Delta_{ij}\leqslant 0$ & has non-positive \textbf{off-diagonal} coefficients \\
\end{tabular}
```

`\begin{center}`{=latex}

 \
This can be $\BR{generalized}$ to weighted, continuous domains \
with e.g. the **cotan** Laplacian on triangle meshes.

`\end{center}`{=latex}


## Regularizing signals with heat diffusion

To **regularize** a signal $f$, we may compute:

```{=latex}
\vspace*{-.7cm}
\begin{align*}
\text{Regularize}(f) &= \arg \min_g ~\|f - g\|^2 ~+~ g^\top \Delta g \\
&= (I + \Delta)^{-1} f ~~=~~ \simeq e^{-\Delta} f
\end{align*}
```

"Regularize" corresponds to a **linear** $\BR{diffusion}$ **operator** $Q$ that:

```{=latex}
\begin{tabular}{rcl}
i)  & $Q^\intercal = Q$ & is \textbf{symmetric} \\
ii) & $Q \mathbf{1} = \mathbf{1}$ & preserves \textbf{constant} signals  \\
iii) & $\text{Spectrum}(Q) \subset [0, 1]$  & is a \textbf{damping} operator \\
iv) & $f \geqslant 0 \Longrightarrow Q f \geqslant 0$ & has \textbf{non-negative} coefficients \\
\end{tabular}
```

Diffusion $\BR{preserves the mass}$ of input signals: 
$$ \langle \mathbf{1}, Qf\rangle ~=~ \langle Q \mathbf{1}, f \rangle ~=~ \langle \mathbf{1}, f \rangle $$



## DiffusionNet `\cite{sharp2022diffusionnet}`{=latex}

`\begin{center}`{=latex}
![](images/DiffusionNet.png){height=6cm}

Super neat, but requires a **pre-factorization** of $\Delta$. \
Not GPU or real-time friendly.
`\end{center}`{=latex}


# Smoothing with local averages

## Graph convolutions and smoothing with adjacency matrices

`\vspace*{.5cm}`{=latex}

::: columns
:::: {.column width=25%}
`\begin{center}\vspace*{.8cm}`{=latex}

![](images/toy_graph.png){ height=6cm }

**Graph** with $5$ nodes. \
**Signals** $f$ are vectors \
$(f_0, f_1, f_2, f_3, f_4)$.

`\end{center}`{=latex}
::::
:::: {.column width=70%}


```{=latex}
\begin{equation*}
S~=~ \text{Degree} ~+~ \text{Adjacency} ~=~
\begin{pmatrix}
4 & 1 & 1 & 1 & 1 \\
1 & 1 & & & \\
1 & & 1 & & \\
1 & & & 1 & \\
1 & & & & 1
\end{pmatrix}
\end{equation*}
```


$S$ is a $\BR{smoothing}$ matrix that:
```{=latex}
\begin{tabular}{rcl}
i)  & $S^\intercal = S$ & is \textbf{symmetric} \\
ii) & $f^\top S f \geqslant 0$  & is a \textbf{positive} operator \\
iii) & $f \geqslant 0 \Longrightarrow S f \geqslant 0$ & has \textbf{non-negative} coefficients \\
\end{tabular}
```

 \
But $S$ **does** $\BR{not}$ **really behave** like a diffusion: $S\mathbf{1} \neq \mathbf{1}$.


::::
:::


## How to normalize our adjacency matrix $S$ to recover a well-behaved diffusion $Q$?

1. **Row normalization.** Use $Q = (S\mathbf{1})^{-1} S$. \
   This guarantees the $\BR{preservation of constants}$: $Q\mathbf{1} = \mathbf{1}$. \
   Problem: we lose symmetry,  $Q^\top \neq Q$, $~\langle \mathbf{1}, Qf \rangle \neq \langle \mathbf{1}, f\rangle$.

2. **Symmetric normalization.** Use $Q = (S\mathbf{1})^{-1/2} \, S \, (S\mathbf{1})^{-1/2}$. \
   This guarantees $\BR{symmetry}$: $Q^\top = Q$. \
   Problem: we do not preserve constants, $Q \mathbf{1} \neq \mathbf{1}$.

3. **Sinkhorn normalization.** Iterate step 2! \
   This converges quickly to a diagonal matrix $\Lambda > 0$ such that
   $Q\mathbf{1} = \Lambda S \Lambda \mathbf{1} = \mathbf{1}$. \
   We guarantee $\BR{both}$ **symmetry** and the **preservation of constants**. 

$\BR{Sinkhorn}$ turns **any smoothing matrix** $S > 0$ 
into a **genuine diffusion operator**.


## A simple and versatile algorithm

`\begin{center}`{=latex}
![](images/algorithm.png){height=6cm}
`\end{center}`{=latex}


::: columns
:::: {.column width=65%}

 \
This is $\BR{much cheaper}$ and more **general** than using:

- $Q = (I + \Delta)^{-1}$ via a direct solver \
  or a sparse Cholesky decomposition.
- $Q = e^{-\Delta}$ via a truncated eigendecomposition.

::::
:::: {.column width=32%}
`\begin{center}`{=latex}
![](images/runtimes.png){height=5cm}
`\end{center}`{=latex}
::::
:::


# Sinkhorn normalization

## Fast convergence: monitoring the average value of $|Q1 - 1|$

`\vspace*{.5cm}`{=latex}

::: columns
:::: {.column width=40%}
`\begin{center}`{=latex}

![](images/sinkhorn_convergence/graphs_random.pdf){ height=6cm }

Random graph with $n$ nodes.

`\end{center}`{=latex}
::::
:::: {.column width=55%}
`\begin{center}`{=latex}

![](images/sinkhorn_convergence/convergence_graphs_random.pdf){ height=6cm }

After 5 iterations: 0.01% error.
`\end{center}`{=latex}
::::
:::


## Fast convergence: monitoring the average value of $|Q1 - 1|$

`\vspace*{.5cm}`{=latex}

::: columns
:::: {.column width=40%}
`\begin{center}`{=latex}

![](images/sinkhorn_convergence/graphs_geometric.pdf){ height=6cm }

Geometric graph with $n$ nodes.

`\end{center}`{=latex}
::::
:::: {.column width=55%}
`\begin{center}`{=latex}

![](images/sinkhorn_convergence/convergence_graphs_geometric.pdf){ height=6cm }

After 5 iterations: 0.5% error.
`\end{center}`{=latex}
::::
:::


## Fast convergence: monitoring the average value of $|Q1 - 1|$

`\vspace*{.5cm}`{=latex}

::: columns
:::: {.column width=40%}
`\begin{center}`{=latex}

![](images/sinkhorn_convergence/armadillo.jpg){ height=6cm }

Armadillo surface -- 5,000 points.

`\end{center}`{=latex}
::::
:::: {.column width=55%}
`\begin{center}`{=latex}

![](images/sinkhorn_convergence/convergence_graphs_armadillo.pdf){ height=6cm }

After 5 iterations: 0.1% error.
`\end{center}`{=latex}
::::
:::

## Fixing the "central node bias"

`\vspace*{1cm}`{=latex}

`\begin{center}`{=latex}

![](images/graph_toy_diffusion.png){ height=6cm }

`\end{center}`{=latex}


## Well-posedness

Focus on **convolution** with a Gaussian or exponential kernel on $\mathbb{R}^d$:

$$
S_\mu f ~:~ x_i ~\mapsto~ \sum_j k(x_i, x_j) m_j f(x_j)
$$

We can interpret the diagonal **scaling** matrix $\Lambda$ for $Q_\mu = \Lambda S_\mu \Lambda$ \
as pointwise $\BR{multiplication}$ with a positive, **continuous** function $\lambda$.

Using standard lemmas from optimal transport theory, we show that:
$$
\mu^t \rightharpoonup \mu 
~~\Longrightarrow~~ 
\lambda^t \xrightarrow{\|\cdot\|_\infty} \lambda
~~\Longrightarrow~~ 
Q_\mu f = \lambda^t S_\mu \lambda^t f \xrightarrow{\|\cdot\|_\infty} \lambda S \lambda f = Q f~.
$$


## Spectral convergence -- 10th eigenvector on the Armadillo


`\vspace*{-.5cm}`{=latex}

`\begin{center}`{=latex}

![](images/spectrum_armadillo.pdf){ height=100% }

`\end{center}`{=latex}


## Gradient flows -- without regularization

`\vspace*{.25cm}`{=latex}

`\begin{center}`{=latex}

![](images/gradients_classic.pdf){ height=100% }

Wasserstein gradient flow of the Energy Distance.

`\end{center}`{=latex}

## Gradient flows -- with a Gaussian regularization at scale $\sigma = 0.07$

`\vspace*{0cm}`{=latex}

`\begin{center}`{=latex}

![](images/gradients_007.pdf){ height=100% }

`\end{center}`{=latex}


## Shape metrics -- geodesic interpolation and extrapolation


`\vspace*{.5cm}`{=latex}

`\begin{center}`{=latex}

![](images/geodesic.pdf){ height=100% }

Normalizing LDDMM kernel metrics fixes the "**exploding geodesics**" problem. \
We obtain a versatile and topology-preserving metric for shape analysis.

`\end{center}`{=latex}


## Use a normalized Gaussian convolution instead of a pre-factored Laplacian

`\vspace*{.25cm}`{=latex}

`\begin{center}`{=latex}

![](images/QDiffNet.png){ height=100% }

`\end{center}`{=latex}

## Conclusion

We used to face $\BR{dilemmas}$:

- $\BR{Smooth}$ with **Laplacians** (expensive) or with **local averages** (biased).

- $\BR{Normalize}$ operators with **row-wise** or **symmetric** scaling.


A simple trick -- $\BR{iterate}$ the symmetric scaling update:

- Cheap and **versatile**.
  
- Turn $\BR{convolutions}$ into genuine **diffusion** operators.
  
- Fix the **"central node bias"**.


`\begin{center}`{=latex}

Non-intrusive method to enforce theoretical axioms. \
Ideally suited to modern parametric models.

`\end{center}`{=latex}



# References