--- title: "Metrics" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Metrics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bibliography.bibtex --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(rnndescent) ``` A lot of distance functions are implemented in `rnndescent`, which you can specify in every function which needs them with the `metric` parameter. Technically not all of these are metrics, but let's just let that slide. Typical are `"euclidean"` or `"cosine"` the latter being more common for document-based data. For binary data, `"hamming"` or `"jaccard"` might be a good place to start. The metrics here are a subset of those offered by the [PyNNDescent](https://github.com/lmcinnes/pynndescent/tree/master) Python package which in turn reproduces those in the [scipy.spatial.distance](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html#module-scipy.spatial.distance) module of [SciPy](https://scipy.org/). Many of the binary distances seem to have definitions shared with [@choi2010survey] so you may want to look in that reference for an exact definition. * `"braycurtis"`: [Bray-Curtis](https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity). * `"canberra"`: [Canberra](https://en.wikipedia.org/wiki/Canberra_distance). * `"chebyshev"`: [Chebyshev](https://en.wikipedia.org/wiki/Chebyshev_distance), also known as the L-infinity norm ($L_\infty$). * `"correlation"`: 1 minus the [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). * `"cosine"`: 1 minus the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). * `"dice"`: the [Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient), also known as the Sørensen–Dice coefficient. Intended for binary data. * `"euclidean"`: the Euclidean distance, also known as the L2 norm. * `"hamming"`: the [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance). Intended for binary data. * `"hellinger"`: the [Hellinger distance](https://en.wikipedia.org/wiki/Hellinger_distance). This is intended to be used with a probability distribution, so ensure that each row of your input data contains non-negative values which sum to `1`. * `"jaccard"`: the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index), also known as the Tanimoto coefficient. Intended for binary data. * `"jensenshannon"`: the [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence). Like `"hellinger"`, this is intended to be used with a probability distribution. * `"kulsinski"`: the Kulsinski dissimilarity as defined in the Python package `scipy.spatial.distance.kulsinski` (this function is deprecated in scipy). Intended for binary data. * `"sqeuclidean"` (squared Euclidean) * `"manhattan"`: the Manhattan distance, also known as the L1 norm or [Taxicab distance](https://en.wikipedia.org/wiki/Taxicab_geometry). * `"rogerstanimoto"`: the [Rogers-Tanimoto coefficient](https://en.wikipedia.org/wiki/Qualitative_variation#Rogers%E2%80%93Tanimoto_coefficient). * `"russellrao"`: the [Russell-Rao coefficient](https://en.wikipedia.org/wiki/Qualitative_variation#Russel%E2%80%93Rao_coefficient). * `"sokalmichener"`. Intended for binary data. * `"sokalsneath"`: the [Sokal-Sneath coefficient](https://en.wikipedia.org/wiki/Qualitative_variation#Sokal%E2%80%93Sneath_coefficient). Intended for binary data. * `"spearmanr"`: 1 minus the [Spearman rank correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) * `"symmetrickl"` symmetrized version of the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence). The symmetrization is calculated as $D_{KL}(P||Q) + D_{KL}(Q||P)$. * `"tsss"` the Triangle Area Similarity-Sector Area Similarity or TS-SS metric as described in [@7474366]. Compared to results in PyNNDescent (as of version 0.5.11), distances are smaller by a factor of 2 in this package. This does not affect the returned nearest neighbors, only the distances. Multiply them by 2 if you need to get closer to the PyNNDescent results. * `"yule"` the Yule dissimilarity. Intended for binary data. For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions: * `"cosine-preprocess"`: `cosine` with preprocessing. * `"correlation-preprocess"`: `correlation` with preprocessing. ## Specialized Binary Metrics Some metrics are intended for use with binary data. This means that: * Your numeric data should consist of only two distinct values, typically `0` and `1`. You will get unpredictable results otherwise. * If you provide the data as a `logical` matrix, a much faster implementation is used. The metrics you can use with binary data are: * `"dice"` * `"hamming"` * `"jaccard"` * `"kulsinski"` * `"matching"` * `"rogerstanimoto"` * `"russellrao"` * `"sokalmichener"` * `"sokalsneath"` * `"yule"` Here's an example of using binary data stored as 0s and 1s with the `"hamming"` metric: ```{r binary data} set.seed(42) binary_data <- matrix(sample(c(0, 1), 100, replace = TRUE), ncol = 10) head(binary_data) ``` ```{r hamming} nn <- brute_force_knn(binary_data, k = 4, metric = "hamming") ``` Now let's convert it to a logical matrix: ```{r logical data} logical_data <- binary_data == 1 head(logical_data) ``` ```{r logical hamming} logical_nn <- brute_force_knn(logical_data, k = 4, metric = "hamming") ``` The results will be the same: ```{r compare} all.equal(nn, logical_nn) ``` but on a real-world dataset, the logical version will be much faster. ## References