Biodiversity, Networks & Data

nadler2006diffusion

Diffusion maps, spectral clustering and reaction coordinates of dynamical systems

Boaz Nadler, Stéphane Lafon, Ronald R. Coifman and Ioannis G. Kevrekidis

Applied and Computational Harmonic Analysis 21, 113-127, 2006

A central problem in data analysis is the low dimensional representation of high dimensional data and the concise description of
its underlying geometry and density. In the analysis of large scale simulations of complex dynamical systems, where the notion of
time evolution comes into play, important problems are the identification of slow variables and dynamically meaningful reaction
coordinates that capture the long time evolution of the system. In this paper we provide a unifying view of these apparently
different tasks, by considering a family of diffusion maps, defined as the embedding of complex (high dimensional) data onto a low
dimensional Euclidean space, via the eigenvectors of suitably defined random walks defined on the given datasets. Assuming that
the data is randomly sampled from an underlying general probability distribution \(p(x) = e^{−U(x)}\), we show that as the number
of samples goes to infinity, the eigenvectors of each diffusion map converge to the eigenfunctions of a corresponding differential
operator defined on the support of the probability distribution. Different normalizations of the Markov chain on the graph lead to
different limiting differential operators. Specifically, the normalized graph Laplacian leads to a backward Fokker–Planck operator
with an underlying potential of \(2U(x)\), best suited for spectral clustering. A different anisotropic normalization of the random walk
leads to the backward Fokker–Planck operator with the potential \(U(x)\), best suited for the analysis of the long time asymptotics
of high dimensional stochastic systems governed by a stochastic differential equation with the same potential \(U(x)\). Finally, yet
another normalization leads to the eigenfunctions of the Laplace–Beltrami (heat) operator on the manifold in which the data resides,
best suited for the analysis of the geometry of the dataset regardless of its possibly non-uniform density.