A lot of distance functions are implemented in
rnndescent
, which you can specify in every function which
needs them with the metric
parameter. Technically not all
of these are metrics, but let’s just let that slide. Typical are
"euclidean"
or "cosine"
the latter being more
common for document-based data. For binary data, "hamming"
or "jaccard"
might be a good place to start.
The metrics here are a subset of those offered by the PyNNDescent Python package which in turn reproduces those in the scipy.spatial.distance module of SciPy. Many of the binary distances seem to have definitions shared with (Choi et al. 2010) so you may want to look in that reference for an exact definition.
"braycurtis"
: Bray-Curtis."canberra"
: Canberra."chebyshev"
: Chebyshev,
also known as the L-infinity norm (L∞)."correlation"
: 1 minus the Pearson
correlation."cosine"
: 1 minus the cosine
similarity."dice"
: the Dice
coefficient, also known as the Sørensen–Dice coefficient. Intended
for binary data."euclidean"
: the Euclidean distance, also known as the
L2 norm."hamming"
: the Hamming
distance. Intended for binary data."hellinger"
: the Hellinger
distance. This is intended to be used with a probability
distribution, so ensure that each row of your input data contains
non-negative values which sum to 1
."jaccard"
: the Jaccard index,
also known as the Tanimoto coefficient. Intended for binary data."jensenshannon"
: the Jensen-Shannon
divergence. Like "hellinger"
, this is intended to be
used with a probability distribution."kulsinski"
: the Kulsinski dissimilarity as defined in
the Python package scipy.spatial.distance.kulsinski
(this
function is deprecated in scipy). Intended for binary data."sqeuclidean"
(squared Euclidean)"manhattan"
: the Manhattan distance, also known as the
L1 norm or Taxicab
distance."rogerstanimoto"
: the Rogers-Tanimoto
coefficient."russellrao"
: the Russell-Rao
coefficient."sokalmichener"
. Intended for binary data."sokalsneath"
: the Sokal-Sneath
coefficient. Intended for binary data."spearmanr"
: 1 minus the Spearman
rank correlation"symmetrickl"
symmetrized version of the Kullback-Leibler
divergence. The symmetrization is calculated as DKL(P||Q) + DKL(Q||P)."tsss"
the Triangle Area Similarity-Sector Area
Similarity or TS-SS metric as described in (Heidarian and Dinneen 2016). Compared to
results in PyNNDescent (as of version 0.5.11), distances are smaller by
a factor of 2 in this package. This does not affect the returned nearest
neighbors, only the distances. Multiply them by 2 if you need to get
closer to the PyNNDescent results."yule"
the Yule dissimilarity. Intended for binary
data.For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
"cosine-preprocess"
: cosine
with
preprocessing."correlation-preprocess"
: correlation
with
preprocessing.Some metrics are intended for use with binary data. This means that:
0
and 1
. You will get unpredictable
results otherwise.logical
matrix, a much
faster implementation is used.The metrics you can use with binary data are:
"dice"
"hamming"
"jaccard"
"kulsinski"
"matching"
"rogerstanimoto"
"russellrao"
"sokalmichener"
"sokalsneath"
"yule"
Here’s an example of using binary data stored as 0s and 1s with the
"hamming"
metric:
set.seed(42)
binary_data <- matrix(sample(c(0, 1), 100, replace = TRUE), ncol = 10)
head(binary_data)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 0 0 0 1 0 1 0 1 1 0
#> [2,] 0 1 0 1 0 0 1 0 0 1
#> [3,] 0 0 0 1 1 1 1 1 1 0
#> [4,] 0 1 0 1 1 1 1 1 0 1
#> [5,] 1 0 0 0 1 1 1 1 0 1
#> [6,] 1 0 1 1 1 1 1 1 1 1
Now let’s convert it to a logical matrix:
logical_data <- binary_data == 1
head(logical_data)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
#> [2,] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
#> [3,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
#> [4,] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
#> [5,] TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
#> [6,] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The results will be the same:
but on a real-world dataset, the logical version will be much faster.