The 21st century is characterized by rise of massive data sets that are not only large because of sheer size but also complex due to the breadth of information they capture. Many well-established data analysis methods fail in these novel datasets due to the so-called curse of dimensionality. Diffusion maps, a method proposed by Coifman et al. [1,2,3] is a powerful tool to reduce this dimensionality by identifying the the most important dimensions in a system, which emerge as nonlinear functions of the primary data.
In contrast to linear methods such as principal component analysis (PCA), the diffusion map is a fully nonlinear method. In contrast to many other methods that are superficially similar (e.g. neighborhood embeddings) the diffusion map focusses on capturing the large-scale structure of datasets. It profits from a strong physical intuition and is fully deterministic and (in our formulation) almost parameter-free.
Diffusion maps are a broadly applicable approach, but many of their early applications were in computer science and particularly in computer vision. At BioND lab we have focused on carrying diffusion maps into applications to create social, ecological and industrial impact.
The diffusion map identifies new variables that describe the COAs and ranks them by importance. While the method does not provide an interpretation for these variables, some of the variables can be easily interpreted. We identified the most important variable as the "university student density". That this variable shows up as the most important emergent variable in the census (i.e. the factor that most shapes the census) is partly due to biases in the census questions. Students answer many census questions in the same way: they live with other people who they are not related to, they have A-levels but no university degree, etc.
We identified the second variable as economic deprivation. Comparing the distribution of this variable with a map of former social housing projects (council estates) yields a disturbingly accurate match. This is remarkable because the British Census does not contain income data. Instead the deprivation signal detected by the diffusion map correlates strongly with more than 600 of the census statistics, and thus picks up on the distinct living circumstances of people in economic deprivation.
Besides photosynthesis we find other variables that encode resource utilization strategies such sulphur reduction. Others include defenses like alcohol and acid production, cold and antibiotic resistance, and others represent adaptations to a specific habitat, such as production of biopolymers in the oceans and the production of specific molecules that are exchange with plant and animal symbionts.
However, we show that even without knowing the traits functional diversity can be robustly quantified. Because traits ultimately determine the biogeography of a species, information on the traits is contained in the biogeographical distribution. This gives us an idea. We consider two species to be similar if they are distributed in a similar way among the samples. This naive notion of similarity has many problems, but these problems can be largely fixed by running the naive similarities through the diffusion map. The result is a method that can robustly quantify the functional diversity observed in a monitoring dataset [6].
To solve this problem we used the diffusion map to separately embed the sockets and the lights in a low (2 or 3) dimensional space. The matching between lights and sockets is then performed in this diffusion space, which yields 100% matching accuracy under realistic conditions and hence performs better and faster than previous methods.
The curse of dimensionality appears because most pairwise comparisons that we make in a high-dimensional space compare very disparate objects, and disparate objects are hard to compare: Even in our daily lives we find it much easier to compare two songs or two paintings than comparing a song to a painting. If we make comparisons in a high-dimensional data space we are not just comparing apples and oranges, we might end up comparing an orange to world peace and the color purple.
Fortunately real data points are not randomly distributed in the high-dimensional space, they typically lie on fairly thin surfaces called manifolds. To illustrate this consider motor vehicles. There are millions of different measurements that one could take on a car, ranging from physical dimensions via the rubber composition of the tires to the absorption spectrum of the paint job. So we could think vehicles live in their own high-dimensional dataspace. But suppose I mentioned that I am thinking of a vehicle that has a gross weight of 40 tons and is 18m long. Just specifying these two dimensions gives you a good idea what kind of vehicle I am talking about. This is how manifolds work: Our data points are not all over the place, most of them are on certain narrow leaves or spines and once we know a few statistics we have a good idea which leave we are talking about and almost all the other properties of the data point can be guessed with good confidence.
The way to dispel the curse of dimensionality is to only allow comparisons between data points that are so close together that we can faithfully compare them. Since we often don't know how far comparisons are faithful we allow every point to be compared with its 10 nearest neighbors.
We can now imagine the allowed comparisons as links in a network of data points. Establishing this network enables us to make comparisons between distant points by measuring the distance on the network of allowed comparisons. The diffusion map quantifies the distances between distal points in terms of a diffusion distances, which can be conveniently computed from the eigenvectors of the Laplace matrix of the network.
The Laplacian eigenvectors also have a meaning in their own right. If we have \(N\) data points to begin with, each Laplacian eigenvector will have \(N\) elements, so each element of the vector corresponds to one of the nodes. We can say that each eigenvector assigns a new variable to each of the nodes. Since we have \(N\) such eigenvectors we get a set of \(N\) new variables for every node. In our parameterization the most important of these variables is the one which corresponds to smallest Laplacian eigenvalue that is not zero. The second one comes from the second non-zero eigenvalue and so on.