1) Working with unlabeled data

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

# Mandatory imports...
import numpy as np
import torch
from torch import tensor
from torch.nn import Parameter

# Custom modules:
from model import Model  # Optimization blackbox
from display import plot # Simple plotting routine for measures
from torch_complex import rot # Rotations in 2D

Outline. In yesterday's session, we've seen how to perform shape analysis on labeled point clouds. In practice, though, most datasets are unlabeled and we do not know a good mapping (or registration) between the source and the target.

In this notebook, we present a simple fidelity, or data attachment term that allows us to work with unlabeled shapes. Needless to say, this approach can be used in conjunction with a set of reference landmarks position given by a features detector: one just needs to sum the unlabeled fidelity and the standard $L^2$ distance between the landmarks.

A) Encoding shapes as measures

Intensity vs Density. In computational anatomy, we're mostly working with two types of data: images (intensity volumes) and segmentation maps (density volumes, curves or surface meshes). Medical images pose specific challenges (signal drift in MRIs, huge memory footprint...) and we do not intend to adress them in this introductory session.

In today's workshop, we will thus focus on the (mathematically) cleaner setting of segmentation maps and work with grayscale bitmaps $A$ and $B$ that represent densities of a specific tissue in the 2D plane. Mathematically, we can think of $A$ and $B$ as measures $\alpha$ and $\beta$, and write them as sums of atomic Dirac masses:

$$ \begin{align} \alpha~&=~\sum_{i=1}^N \alpha_i\,\delta_{x_i}, & \beta~&=~\sum_{j=1}^M \beta_j\,\delta_{y_j}. \end{align} $$

Here, $(x_i)$ and $(y_j)$ are encoded through real valued N-by-2 and M-by-2 tensors while $(\alpha_i)$ and $(\beta_j)$ are given as N-by-1 and M-by-1 positive vectors. For the sake of simplicity, we will normalize both segmentation maps:

$$ \begin{align} \sum_{i=1}^N \alpha_i ~~=~~1~~=~~\sum_{j=1}^M \beta_j. \end{align} $$

In [2]:
from sampling import load_measure

# Images from the Spectral Log-Demons paper (2013, Lombaert et Al.) 
α = load_measure("data/heart_a.png")
β = load_measure("data/heart_b.png")
print("Number of atoms : N={} and M={}.".format(len(α[0]), len(β[0])))
Number of atoms : N=630 and M=596.

Restricting ourselves to operations that are well-defined on measures allows us to guarantee sampling and parametrization-invariance in our algorithms.

Since we're working on your laptop's CPU, we use tiny images... let's display them in the unit square:

In [3]:
plt.figure(figsize=(10,10))
plot(β, "blue", alpha=.7)
plot(α, "red", alpha=.7)
plt.axis("equal")
plt.axis([0,1,0,1])
plt.show()

Global divergences. To compare probability distributions with varying supports, the simplest method is to pick a conditionally positive, universal kernel $k:\mathbb{R}^2\rightarrow\mathbb{R}$ and to use

$$ \begin{align} \tfrac{1}{2}\|\alpha-\beta\|_k^2 ~=~ \tfrac{1}{2}\langle \alpha - \beta, k\star(\alpha-\beta)\rangle. \end{align} $$

as a squared norm between the sampled measures. Here, $\star$ is the standard, linear convolution product so that for instance

$$ \begin{align} \langle\alpha, k\star\beta\rangle ~=~ \sum_{i=1}^N\sum_{j=1}^M \alpha_i\beta_j\,k(x_i-y_j) ~=~ \alpha_i^\top\,K_{x_iy_j}\,\beta_j, \end{align} $$ where $K_{x_iy_j}$ is the kernel matrix associated to $(x_i)$ and $(y_j)$.

Unfortunately, standard machine-learning kernels are unfit for the registration of sampled shapes. The Gaussian RBF kernel $\exp(-\|x-y\|^2/2\sigma^2)$ for instance, has nothing but weaknesses in this setting: its smoothness means that it is blind to high-frequency oscillations; its compact support, that the norm's gradient with respect to the $x_i$'s vanishes away from the support of $\beta$.

In practice (as advocated in Global divergences between measures (2018), Feydy, Trouvé), the scale-invariant Energy Distance kernel from statistics

$$ \begin{align} k(x-y)~~=~~ -\|x-y\| \end{align} $$

provides a much more robust baseline: it is pointy, has global support and provides well-behaved gradients.

In [4]:
def scal( f, g ) :
    "Scalar product between two vectors."
    return torch.dot( f.view(-1), g.view(-1) )

def sqdistances(x_i, y_j) :
    "Matrix of squared distances, C_ij = |x_i-y_j|^2."
    return ( (x_i.unsqueeze(1) - y_j.unsqueeze(0)) ** 2).sum(2)

def distances(x_i, y_j) :
    "Matrix of distances, C_ij = |x_i-y_j|."
    return (x_i.unsqueeze(1) - y_j.unsqueeze(0)).norm(p=2,dim=2)

def fidelity(α, β) :
    "Energy Distance between two sampled probability measures."
    α_i, x_i = α
    β_j, y_j = β
    K_xx = -distances(x_i, x_i)
    K_xy = -distances(x_i, y_j)
    K_yy = -distances(y_j, y_j)
    cost = .5*scal( α_i, K_xx @ α_i ) \
         -    scal( α_i, K_xy @ β_j ) \
         + .5*scal( β_j, K_yy @ β_j )
    return cost

B) Rigid registration

Equipped with a parametrization-invariant fidelity, we can now register a measure onto another. Using, for instance, the similarity group:

In [5]:
class RigidRegistration(Model) :
    "Find the optimal translation, scaling and rotation."
    def __init__(self, α) :
        "Defines the parameters of a rigid deformation of α."
        super(Model, self).__init__()
        self.α, self.x = α[0].detach(), α[1].detach()
        self.λ = Parameter(tensor(  0.     )) # log-Scale
        self.θ = Parameter(tensor(  0.     )) # Angle
        self.Ï„ = Parameter(tensor( [0.,0.] )) # Position

    def __call__(self, t=1.) :
        # At the moment, PyTorch does not support complex numbers...
        x_t = (t*self.λ).exp()*rot(self.x, t*self.θ) + t*self.τ
        return self.α, x_t 
    
    def cost(self, target) :
        "Returns a cost to optimize."
        return fidelity( self(), target)

To understand the trajectory from $\alpha$ to $\beta$, we display the model at intermediate timesteps:

In [6]:
def train_and_display(Model, source, target) :
    # Model: orbit of the source
    model = Model( source ) 
    model.fit( target ) # Fit to the target, scikit-learn like

    plt.figure(figsize=(10,10))
    plot(target, "blue", alpha=.7)
    plot(source, "purple", alpha=.7)
    for t in [.5] : plot(model(t), "green", alpha=.4)
    plot(model(), "red")
    plt.axis("equal")
    plt.axis([0,1,0,1])
    
    plt.figure(figsize=(10,3))
    plt.subplot(1,3,1)
    plot(target, "blue",   alpha=.4, scale=.1)
    plot(source, "purple", alpha=.4, scale=.1)
    plt.axis("equal")
    plt.axis([0,1,0,1])
    
    plt.subplot(1,3,2)
    plot(target,    "blue",  alpha=.4, scale=.1)
    plot(model(.5), "green", alpha=.4, scale=.1)
    plt.axis("equal")
    plt.axis([0,1,0,1])
    
    plt.subplot(1,3,3)
    plot(target, "blue", alpha=.4, scale=.1)
    plot(model(), "red", alpha=.4, scale=.1)
    plt.axis("equal")
    plt.axis([0,1,0,1])
    
    plt.show()
    
    return model

Let's see how it goes!

In [7]:
rigid = train_and_display(RigidRegistration, α, β)
It  1:0.004. It  2:0.007. It  3:0.002. It  4:0.002. It  5:0.002. 
It  6:0.001. It  7:0.001. It  8:0.001. It  9:0.001. It 10:0.001. 

b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'

C) Affine registration

Rigid registration is important... but little more than a pre-processing step. In practice, we'd like to make sense of the shape variability by putting more information into the model's parameters. For instance, we could use an affine deformation

$$ \begin{align} \Phi_{A,b}~:~x\in\mathbb{R}^{2}\mapsto A x + b \in\mathbb{R}^2 \end{align} $$

parameterized by a 2-by-2 matrix $A$ and a vector $b$.

Exercise 1: Implement an affine registration module.

Hint: Remember that in python, vectors are stored along the tensors' lines.

In [8]:
class AffineRegistration(Model) :
    "Find the optimal affine deformation."
    def __init__(self, α) :
        "Defines the parameters of an affine deformation of α."
        super(Model, self).__init__()
        self.α_i = α[0].clone()
        self.x_i = α[1].clone()
        self.A = Parameter(torch.eye(2)   ) # Transformation
        self.B = Parameter(torch.zeros(2) ) # Translation

    def __call__(self, t=1.) :
        x_t = self.x_i @ ((1-t)*torch.eye(2)+t*self.A) + t*self.B
        return self.α_i, x_t 
    
    def cost(self, target) :
        "Returns a cost to optimize."
        return fidelity( self(), target)

affine = train_and_display(AffineRegistration, α, β)
It  1:0.512. It  2:0.389. It  3:0.056. It  4:0.046. It  5:0.037. 
It  6:0.035. It  7:0.029. It  8:0.010. It  9:0.003. It 10:0.003. 
It 11:0.002. It 12:0.001. It 13:0.001. It 14:0.001. It 15:0.001. 
It 16:0.001. It 17:0.001. It 18:0.001. It 19:0.001. 
b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
<Figure size 432x288 with 0 Axes>

D) Free particles

This is all well and good; but the residual between the registered source and the target stil carries a lot of wasted information. In order to capture the full shape variability, we could be tempted to use a straightforward free particles model, updating uncorrelated displacement vectors $v_i$ and using

$$ \begin{align} \Phi_v~:~(x_i)\in\mathbb{R}^{N\times2}~\mapsto~ (x_i+v_i)\in\mathbb{R}^{N\times2}. \end{align} $$

Exercise 2: Implement a simple "free particles" registration routine. How would you describe its behavior?

In [9]:
class L2Flow(Model) :
    "Find the optimal mass positions."
    def __init__(self, α) :
        "Defines the parameters of a free deformation of α."
        super(Model, self).__init__()
        self.α_i, self.x_i = α[0].detach(), α[1].detach()
        self.v_i = Parameter(torch.zeros_like(self.x_i))

    def __call__(self, t=1.) :
        "Applies the model on some point cloud, an N-by-2 tensor."
        x_t = self.x_i + t*self.v_i
        return self.α_i, x_t 
    
    def cost(self, target) :
        "Returns a cost to optimize."
        return fidelity( self(), target)
l2flow = train_and_display(L2Flow, α, β)
It  1:0.004. It  2:0.004. It  3:0.002. It  4:0.000. It  5:0.000. 
It  6:0.000. It  7:0.000. It  8:0.000. It  9:0.000. It 10:0.000. 
It 11:0.000. It 12:0.000. It 13:0.000. It 14:0.000. It 15:0.000. 
It 16:0.000. It 17:0.000. It 18:0.000. It 19:0.000. It 20:0.000. 
It 21:0.000. It 22:0.000. It 23:0.000. It 24:0.000. It 25:0.000. 
It 26:0.000. It 27:0.000. It 28:0.000. It 29:0.000. It 30:0.000. 
It 31:0.000. 
b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'

Exercise 3: Load your own data, and investigate the properties of our registration algorithms. What are the pros and cons of the approaches presented here?