---
title: "Metrics"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Metrics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: bibliography.bibtex
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(rnndescent)
```
A lot of distance functions are implemented in `rnndescent`, which you can
specify in every function which needs them with the `metric` parameter.
Technically not all of these are metrics, but let's just let that slide. Typical
are `"euclidean"` or `"cosine"` the latter being more common for document-based
data. For binary data, `"hamming"` or `"jaccard"` might be a good place to
start.
The metrics here are a subset of those offered by the
[PyNNDescent](https://github.com/lmcinnes/pynndescent/tree/master) Python
package which in turn reproduces those in the
[scipy.spatial.distance](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html#module-scipy.spatial.distance)
module of [SciPy](https://scipy.org/). Many of the binary distances seem to have
definitions shared with [@choi2010survey] so you may want to look in that
reference for an exact definition.
* `"braycurtis"`: [Bray-Curtis](https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity).
* `"canberra"`: [Canberra](https://en.wikipedia.org/wiki/Canberra_distance).
* `"chebyshev"`: [Chebyshev](https://en.wikipedia.org/wiki/Chebyshev_distance),
also known as the L-infinity norm ($L_\infty$).
* `"correlation"`: 1 minus the [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
* `"cosine"`: 1 minus the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
* `"dice"`: the [Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient),
also known as the Sørensen–Dice coefficient. Intended for binary data.
* `"euclidean"`: the Euclidean distance, also known as the L2 norm.
* `"hamming"`: the [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance).
Intended for binary data.
* `"hellinger"`: the [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance).
This is intended to be used with a probability distribution, so ensure that each
row of your input data contains non-negative values which sum to `1`.
* `"jaccard"`: the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index),
also known as the Tanimoto coefficient. Intended for binary data.
* `"jensenshannon"`: the [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence).
Like `"hellinger"`, this is intended to be used with a probability distribution.
* `"kulsinski"`: the Kulsinski dissimilarity as defined in the Python package `scipy.spatial.distance.kulsinski` (this function is deprecated in scipy).
Intended for binary data.
* `"sqeuclidean"` (squared Euclidean)
* `"manhattan"`: the Manhattan distance, also known as the L1 norm or [Taxicab distance](https://en.wikipedia.org/wiki/Taxicab_geometry).
* `"rogerstanimoto"`: the [Rogers-Tanimoto coefficient](https://en.wikipedia.org/wiki/Qualitative_variation#Rogers%E2%80%93Tanimoto_coefficient).
* `"russellrao"`: the [Russell-Rao coefficient](https://en.wikipedia.org/wiki/Qualitative_variation#Russel%E2%80%93Rao_coefficient).
* `"sokalmichener"`. Intended for binary data.
* `"sokalsneath"`: the [Sokal-Sneath coefficient](https://en.wikipedia.org/wiki/Qualitative_variation#Sokal%E2%80%93Sneath_coefficient).
Intended for binary data.
* `"spearmanr"`: 1 minus the [Spearman rank correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
* `"symmetrickl"` symmetrized version of the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
The symmetrization is calculated as $D_{KL}(P||Q) + D_{KL}(Q||P)$.
* `"tsss"` the Triangle Area Similarity-Sector Area Similarity or TS-SS metric
as described in [@7474366]. Compared to results in PyNNDescent (as of version
0.5.11), distances are smaller by a factor of 2 in this package. This does not
affect the returned nearest neighbors, only the distances. Multiply them by 2
if you need to get closer to the PyNNDescent results.
* `"yule"` the Yule dissimilarity. Intended for binary data.
For non-sparse data, the following variants are available with preprocessing:
this trades memory for a potential speed up during the distance calculation.
Some minor numerical differences should be expected compared to the
non-preprocessed versions:
* `"cosine-preprocess"`: `cosine` with preprocessing.
* `"correlation-preprocess"`: `correlation` with preprocessing.
## Specialized Binary Metrics
Some metrics are intended for use with binary data. This means that:
* Your numeric data should consist of only two distinct values, typically
`0` and `1`. You will get unpredictable results otherwise.
* If you provide the data as a `logical` matrix, a much faster implementation
is used.
The metrics you can use with binary data are:
* `"dice"`
* `"hamming"`
* `"jaccard"`
* `"kulsinski"`
* `"matching"`
* `"rogerstanimoto"`
* `"russellrao"`
* `"sokalmichener"`
* `"sokalsneath"`
* `"yule"`
Here's an example of using binary data stored as 0s and 1s with the `"hamming"`
metric:
```{r binary data}
set.seed(42)
binary_data <- matrix(sample(c(0, 1), 100, replace = TRUE), ncol = 10)
head(binary_data)
```
```{r hamming}
nn <- brute_force_knn(binary_data, k = 4, metric = "hamming")
```
Now let's convert it to a logical matrix:
```{r logical data}
logical_data <- binary_data == 1
head(logical_data)
```
```{r logical hamming}
logical_nn <- brute_force_knn(logical_data, k = 4, metric = "hamming")
```
The results will be the same:
```{r compare}
all.equal(nn, logical_nn)
```
but on a real-world dataset, the logical version will be much faster.
## References