(Author: Jean Feydy - July 2022)
Un-ordered point clouds are everywhere in applied mathematics. We use them to describe:
In order to measure reconstruction errors or perform statistics, researchers have defined many distance functions between point clouds in dimension $D$. We often call them "loss functions" in machine learning. You may already be familiar with:
There are good reasons to study these formulas. But beware: even if these quantities all satisfy the axioms of a distance (symmetry, positivity, triangle inequality...), they are not at all equivalent to each other. This notebook illustrates this counter-intuitive fact on a toy example in dimension $D = 2$.
If you would like to dive deeper in this topic, a good start may be to read Chapter 3 of my PhD thesis - Geometric data analysis, beyond convolutions. It contains detailed answers to the questions that are discussed in this tutorial.
You may also be interested by the GeomLoss and KeOps libraries for fast and scalable implementations of the quantities that we are going to define, both on CPU and GPU.
First, we import the standard Python library for array computations. These are fairly similar to e.g. Matlab and Julia.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
We also rely on the PyTorch library, which provides seamless automatic differentiation and GPU support.
import torch
# Shall we use the GPU?
use_cuda = torch.cuda.is_available()
device = torch.device("cuda") if use_cuda else torch.device("cpu")
# PyTorch CPU/GPU Tensor to NumPy array conversion:
numpy = lambda x: x.detach().cpu().numpy()
To keep things simple and assess visually the performance of our methods, we work with point clouds $X$ and $Y$ that we sample on the unit square:
\begin{align} X &~=~ (x_1, \dots, x_N)~\in~\mathbb{R}^{N\times 2}, & Y &~=~ (y_1, \dots, y_N)~\in~\mathbb{R}^{N\times 2}. \end{align}from sampling import draw_samples
# Let's keep things fast by default. Feel free to increase!
N = 1000 # if use_cuda else 200
# X and Y are sampled from two densities, encoded as png images:
x_i = draw_samples("data/density_a.png", N=N, device=device)
y_j = draw_samples("data/density_b.png", N=N, device=device)
We are going to study algorithms that register the moving "source" points $x_i$ onto the fixed "target" points $y_j$. In order to track this movement, we assign a rainbow color scheme to the points $x_i$ based on their starting locations.
# Arbitrary color scheme for the x_i's:
colors_i = (10 * x_i[:, 0]).cos() * (10 * x_i[:, 1]).cos()
colors_i = numpy(colors_i)
# Just a plain blue color for the y_j's:
colors_j = [[0.55, 0.55, 0.95]]
Here is what our points cloud look like:
from sampling import draw_figure
draw_figure(
x_i=x_i,
colors_i=colors_i,
y_j=y_j,
colors_j=colors_j,
)
In order to highlight the properties of a "Loss" or "reconstruction error" between the point clouds $X$ and $Y$, we will track the behaviour of the points $x_i$ as we minimize the function:
\begin{align} \text{Loss}\big(x_1, \dots, x_N~;~ y_1, \dots, y_N\big) \end{align}by gradient descent with respect the the moving points $x_i$. We use a fixed step size ("learning rate"):
\begin{align} \Delta t ~=~ \text{0.05} \end{align}and compute the sequence of iterates:
\begin{align} x_i(t=0) ~&=~ x_i~,& x_i(t+\Delta t) ~=~ x_i(t)-\Delta t\cdot N \cdot \nabla_{x_i} \text{Loss}\big(x_1(t), \dots, x_N(t)~;~ y_1(t), \dots, y_N(t)\big)~. \end{align}def gradient_flow(x_i, y_j, loss, dt=0.05):
"""
Flows along the gradient of the cost function using a simple Euler scheme.
We rely on the PyTorch automatic differentiation engine to compute the
gradient of the loss function.
Parameters
----------
x_i : (N,2) torch tensor
samples of the moving source measure
y_j : (N,2) torch tensor
samples of the source target measure
cost : (α_i,x_i,β_j,y_j) -> torch float number,
real-valued function
dt : float, default = .05
time step, i.e. learning rate
"""
# Parameters for the gradient descent:
Nsteps = int(5 / dt) + 1
display_its = [int(t / dt) for t in [0, 0.25, 0.50, 1.0, 2.0, 5.0]]
# Make sure that we won't modify the input measures
x_i, y_j = x_i.clone(), y_j.clone()
# We're going to perform gradient descent on loss(x_i; y_j)
# wrt. the positions x_i of the points that make up the source:
x_i.requires_grad = True
plt.figure(figsize=(12, 8))
k = 1
for i in range(Nsteps): # Euler scheme ===============
# Compute cost and gradient
l_xy = loss(x_i, y_j)
[g] = torch.autograd.grad(l_xy, [x_i])
if i in display_its: # display
ax = plt.subplot(2, 3, k)
ax.set_title(f"t = {dt*i:1.2f}")
k = k + 1
draw_figure(
x_i=x_i,
colors_i=colors_i,
grad_i=g * N,
y_j=y_j,
colors_j=colors_j,
ax=ax,
)
# in-place modification of the tensor's values
x_i.data -= dt * N * g
We understand this evolution as an ideal, model-free registration or learning problem where a source distribution $X(t)$ is iteratively fitted towards a target (empirical) distribution $Y$.
The main purpose of this tutorial is to illustrate - and discuss! - what happens to the trajectories $x_i(t)$ when we choose a loss function over another.
As a first example, let us consider a squared Euclidean norm between our two point clouds - whose ordering has been defined at random:
\begin{align} \text{Loss}\big(x_1, \dots, x_N~;~ y_1, \dots, y_N\big) ~=~ \frac{1}{2N} \sum_{i=1}^N \|x_i - y_i\|^2~. \end{align}This corresponds to a "Hookian energy" where a small spring has been attached between the $N$ pairs of points $(x_i,y_i)$.
def Euclidean_loss(x_i, y_j):
"""
Simplistic squared Euclidean distance between sampled point clouds,
assuming a pairwise correspondence between x_i[k] and y_j[k].
"""
return 0.5 * ((x_i - y_j) ** 2).sum(1).mean()
gradient_flow(x_i, y_j, Euclidean_loss)
Question 1: What do you think about this trajectory? Is it well-defined? What do you think about the applicability of this first method?
In order to define trajectories that do not rely on a random coupling between the points $x_i$ and $y_j$, a first idea is to associate each point $x_i$ to its closest neighbor among the $y_j$'s. This corresponds to using the reconstruction error:
\begin{align} \text{Loss}\big(x_1, \dots, x_N~;~ y_1, \dots, y_N\big) ~=~ \frac{1}{2N} \sum_{i=1}^N \min_{j=1}^N \|x_i - y_j\|^2~. \end{align}Unlike the Euclidean norm that we discussed earlier, this function is well-defined between un-ordered point clouds. It is closely related to the Hausdorff distance and is a core component of the Iterative Closest Point method for rigid pose estimation. Let's see how it behaves when we apply no regularization to the movement of the points $x_i$:
def squared_distances(x_i, y_j):
"""Returns the (N,N) matrix of squared distances between the points x_i and y_j.
Note that this is a *very* naive Python code, written for pedagogical purposes.
Please see www.kernel-operations.io for a scalable and (much) faster implementation.
"""
diff_xy = x_i.view(N, 1, 2) - y_j.view(1, N, 2) # (N,N,D)
return (diff_xy**2).sum(dim=2) # (N,N)
def ICP_loss(x_i, y_j):
D_xy = squared_distances(x_i, y_j)
dists_i = D_xy.min(dim=1).values
return 0.5 * dists_i.mean()
gradient_flow(x_i, y_j, ICP_loss)
Question 2: What do you think about this trajectory?
We can also try to symmetrize the function above and work with:
def ICP_loss_symmetric(x_i, y_j):
D_xy = squared_distances(x_i, y_j) # (N,N)
dists_i = D_xy.min(dim=1).values # (N,)
dists_j = D_xy.min(dim=0).values # (N,)
return 0.25 * (dists_i.mean() + dists_j.mean())
gradient_flow(x_i, y_j, ICP_loss_symmetric)
Question 3: What do you think about this trajectory? In which cases do you think that we could use this method?
Note: The functions above may seem overly simplistic and out-of-touch with real-world methods. But surprisingly, these trajectories appear as soon as we consider Gaussian Mixture Models and maximize their log-likelihoods. In many respects, the maximum likelihood method behaves as a soft nearest neighbor projection. Using such a formula without understanding it (and adding a proper regularization) can be a recipe for disaster.
This surprising fact is discussed in Section 3.2.2 of Geometric data analysis, beyond convolutions.
The iterative closest point method relied on a nearest neighbor projection. An alternative, popular method is to rely instead on convolutions.
We choose a blurring function $g:\mathbb{R}^2\rightarrow\mathbb{R}$ and consider the blurred functions:
\begin{align} (g\star X)(z) &= \frac{1}{N} \sum_{i=1}^N g(z - x_i) & & \text{and} & (g\star Y)(z) &=\frac{1}{N} \sum_{j=1}^N g(z - y_j)~. \end{align}We can quantify the distance between these two function defined on $\mathbb{R}^2$ by using, for instance, a squared $L^2$ norm:
\begin{align} \text{Loss}(X, Y) ~&=~ \tfrac{1}{2} \| g\star X- g\star Y\|^2_2 \\ ~&=~ \tfrac{1}{2} \iint_{z\in\mathbb{R}^2} (g\star X)^2(z) - 2 \cdot (g\star X)(z) \cdot (g\star Y)(z) + (g\star Y)^2(z)~\text{d}z~. \end{align}Then, if we define $k = \tilde{g}\star g$, where $\tilde{g} = g \circ (x\mapsto -x)$ is the mirrored blurring function, we can show that:
\begin{align} \text{Loss}(X, Y) ~=~ \frac{1}{2 N^2} \Big[ \sum_{i,j=1}^N k(x_i,x_j) - 2 \sum_{i,j=1}^N k(x_i,y_j) +\sum_{i,j=1}^N k(y_i,y_j) \Big] \end{align}We can understand this formula as the energy that is associated to a system of charged particles where:
The kernel function $k$ is at the heart of this method. In practice, one typically forgets about $g$ and implements the squared kernel norm as:
def kernel_loss(kernel, sigma=1.0):
def loss(x_i, y_j):
K_xy = kernel(x_i, y_j, sigma) # (N,N)
K_xx = kernel(x_i, x_i, sigma) # (N,N)
K_yy = kernel(y_j, y_j, sigma) # (N,N)
return (1 / (2 * N**2)) * (K_xx.sum() - 2 * K_xy.sum() + K_yy.sum())
return loss
A first choice is the Gaussian kernel of deviation $\sigma > 0$:
\begin{align} k(x,y) ~=~ \exp\big(-\|x-y\|^2 \,/\, 2\sigma^2\big)~. \end{align}def gaussian_kernel(x_i, y_j, sigma=1.0):
D_xy = squared_distances(x_i / sigma, y_j / sigma) # (N,N)
return (-D_xy / 2).exp()
gradient_flow(x_i, y_j, kernel_loss(gaussian_kernel, sigma=1.0))
gradient_flow(x_i, y_j, kernel_loss(gaussian_kernel, sigma=0.5))
gradient_flow(x_i, y_j, kernel_loss(gaussian_kernel, sigma=0.1))
An other popular alternative is the exponential kernel:
\begin{align} k(x,y) ~=~ \exp\big(-\|x-y\| \,/\, \sigma \big)~. \end{align}def exponential_kernel(x_i, y_j, sigma=1.0):
D_xy = squared_distances(x_i / sigma, y_j / sigma) # (N,N)
return (-(D_xy + 0.00001).sqrt()).exp()
gradient_flow(x_i, y_j, kernel_loss(exponential_kernel, sigma=0.2))
And finally, a robust baseline is provided by the Energy Distance kernel:
\begin{align} k(x,y) = - \| x - y\|~. \end{align}def distance_kernel(x_i, y_j, sigma=1.0):
D_xy = squared_distances(x_i / sigma, y_j / sigma) # (N,N)
return -(D_xy + 0.00001).sqrt()
gradient_flow(x_i, y_j, kernel_loss(distance_kernel, sigma=1.0))
Question 4: Compare the behaviours of these kernel norms for different formulas and scales.
How does the smoothness of the kernel impact the result?
What about the size of its support?
Can you explain why some points "flee" the target instead of converging towards the target?
These kernel norms have a long and rich history in mathematics, engineering and imaging sciences. They are known as Sobolev norms in functional analysis, Maximum Mean Discrepancies in statistics, etc. To go further on this topic, you may start with Section 3.2.3 of Geometric data analysis, beyond convolutions. It provides an overview of the main interpretations of kernel norms and many illustrations.
Finally, a third family of distances between point clouds relies on (generalized) sorting: the optimal transport problem. We define the OT cost between $X$ and $Y$ as:
\begin{align} \text{Loss}\big(x_1, \dots, x_N~;~ y_1, \dots, y_N\big) ~=~ \frac{1}{2\cdot N} \min_{\text{permutations}~\sigma} \sum_{i=1}^N \|x_i - y_{\sigma(i)}\|^2~. \end{align}Chapter 3 of Geometric data analysis, beyond convolutions provides an accessible introduction to the topic, with numerous illustrations. The GeomLoss library also provides fast and scalable solvers for this optimization problem.
But in this tutorial, we can implement a basic yet robust solver in a few lines of code with:
def softmin(eps, C, G):
return - eps * (G.view(1,N) - C.view(N,N) / eps).logsumexp(dim=1)
def OT_loss(blur=0.05, n_iter=20):
def ot_loss(x_i, y_j):
# The (N,N) cost matrices:
C_xy = squared_distances(x_i, y_j) / 2
C_yx = C_xy.t()
C_xx = squared_distances(x_i, x_i) / 2
C_yy = squared_distances(y_j, y_j) / 2
# Simulated annealing heuristic (aka. epsilon-scaling in operations research):
# let the temperature decrease across iterations, from 1 (=diameter) to the target value
scales = torch.from_numpy(np.geomspace(1., blur, n_iter)).to(device=device, dtype=torch.float32)
# Initialize the dual potentials:
uni = - np.log(N) * torch.ones(N, device=device, dtype=torch.float32)
g_xy = softmin(1., C_yx, uni)
f_yx = softmin(1., C_xy, uni)
g_yy = softmin(1., C_yy, uni)
f_xx = softmin(1., C_xx, uni)
# Symmetric Sinkhorn loop, in the log-domain, with annealing:
for scale in scales:
eps = scale ** 2
ft_yx = softmin(eps, C_xy, uni + g_xy / eps) # Y -> X
gt_xy = softmin(eps, C_yx, uni + f_yx / eps) # X -> Y
ft_xx = softmin(eps, C_xx, uni + f_xx / eps) # X -> X
gt_yy = softmin(eps, C_yy, uni + g_yy / eps) # Y -> Y
# Symmetric updates:
f_yx, g_xy = 0.5 * (f_yx + ft_yx), 0.5 * (g_xy + gt_xy) # OT(X,Y)
f_xx, g_yy = 0.5 * (f_xx + ft_xx), 0.5 * (g_yy + gt_yy) # OT(X,X), OT(Y,Y)
# Final iteration to bypass the PyTorch back-propagation:
eps = blur ** 2
f_yx, g_xy = (
softmin(eps, C_xy, (uni + g_xy / eps).detach()),
softmin(eps, C_yx, (uni + f_yx / eps).detach()),
)
f_xx = softmin(eps, C_xx, (uni + f_xx / eps).detach())
g_yy = softmin(eps, C_yy, (uni + g_yy / eps).detach())
# Return the debiased dual cost:
return (f_yx - f_xx).mean() + (g_xy - g_yy).mean()
return ot_loss
gradient_flow(x_i, y_j, OT_loss(blur=0.01))
gradient_flow(x_i, y_j, OT_loss(blur=0.2))
Question: Try using the code above with different values of the blur parameter, which acts as a regularizer.
How does this method compare to the other loss functions?
Question : Re-run these experiments using different samples (you just have to load a different ".png" image at the start of the script).
For instance, try to play around with distributions that have different connected components (= "modes").
What do you think?
In this notebook, we presented three major families of "distances" between point clouds:
Surprisingly, all of these methods induced dramatic differences in the trajectories of the moving points $x_i$.
So which formula should we use in practical applications?
This is the main question, that can only be answered with respect to specific applications and datasets.
Loss functions and distances are very much a part of your model, inject prior into your results and are certainly worth a few hours of discussion!