---
title: "rnndescent"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{rnndescent}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: bibliography.bibtex
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
`rnndescent` is an R package for finding approximate nearest neighbors, based
heavily on the Python package [PyNNDescent](https://github.com/lmcinnes/pynndescent)
by [Leland McInnes](https://github.com/lmcinnes), but is a fully independent
reimplementation written in C++. It uses the following techniques:
1. Initialization by creating a forest of random project trees
[@dasgupta2008random].
2. Optimization by using nearest neighbor descent
[@dong2011efficient].
3. For building a search graph, graph diversification techniques
from FANNG [@harwood2016fanng].
4. For querying new data, the back-tracking search from NGT
[@iwasaki2018optimization] (without dynamic degree-adjustment).
The easiest way to find k-nearest neighbors and query new data is to use the
`rnnd_knn` function, which combine several of the available techniques into
sensible defaults use the `rnnd_build` and `rnnd_query` functions. For greater
flexibility, the underlying functions used by `rnnd_build` and `rnnd_query` can
be used directly. The other vignettes in this package describe their use and go
into more detail about the how the methods work.
```{r setup}
library(rnndescent)
```
## Find the k-nearest neighbors
If you just want the k-nearest neighbors of some data, use `rnnd_knn`:
```{r build knn}
iris_knn <- rnnd_knn(data = iris, k = 5)
```
### The Neighbor Graph Format
The nearest neighbor graph format returned by all functions in this package is
a list of two matrices:
* `idx` -- a matrix of indices of the nearest neighbors. As usual in R, these
are 1-indexed.
* `dist` -- the equivalent distances.
```{r neighbor graph}
lapply(iris_knn, function(x) {
head(x, 3)
})
```
## Build an Index
`rnnd_knn` returns the k-nearest neighbors, but does not return any "index" that
you can use to query new data. To do that, use `rnnd_build`. Normally you would
query the index with different from that which you used to build the index, so
let's split `iris` up:
```{r iris split}
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]
```
```{r build index}
iris_index <- rnnd_build(iris_even, k = 5)
```
The index is also a list but with a lot more components (none of which are
intended for manual examination), apart from the the neighbor graph which can
be found under the `graph` component in the same format as the return value of
`rnnd_knn`:
```{r index neighbor graph}
lapply(iris_index$graph, function(x) {
head(x, 3)
})
```
Be aware that for large and high-dimensional data, the returned index can get
**very** large, especially if you set `n_search_trees` to a large value.
## Querying Data
To query new data, use `rnnd_query`:
```{r query index}
iris_odd_nn <- rnnd_query(
index = iris_index,
query = iris_odd,
k = 5
)
lapply(iris_odd_nn, function(x) {
head(x, 3)
})
```
You don't need to keep the data that was used to build the index around, because
internally, the index stores that (that's one of the reasons the index can get
large).
Another use for `rnnd_query` is to improve the quality of a k-nearest neighbor
graph. We are using for a `query` the same data we used to build `iris_index`
and specifying via the `init` parameter the knn graph we already generated:
```{r query knn data}
iris_knn_improved <- rnnd_query(
index = iris_index,
query = iris_even,
init = iris_index$graph,
k = 5
)
```
If the k-nearest neighbor graph in `index$graph` isn't sufficiently high
quality, then result of running `rnnd_query` on the same data should be an
improvement. Exactly how much better is hard to say, but you can always compare
the sum of the distances:
```{r sum distances}
c(
sum(iris_index$graph$dist),
sum(iris_knn_improved$dist)
)
```
In this case, the initial knn has not been improved, which is hardly surprising
due to the size of the dataset. Another function that might be of use is the
`neighbor_overlap` function to see how many neighbors are shared
between the two graphs:
```{r neighbor overlap}
neighbor_overlap(iris_index$graph, iris_knn_improved)
```
As there was no change to the graph, the overlap is 100%. More details on this
can be found in the [hubness](hubness.html) vignette and a more ambitious
dataset is covered in the [FMNIST article](https://jlmelville.github.io/rnndescent/articles/fmnist-example.html).
## Parallelism
`rnndescent` is multi-threaded, but by default is single-threaded. Set
`n_threads` to set the number of threads you want to use:
```{r parallel}
iris_index <- rnnd_build(data = iris_even, k = 5, n_threads = 2)
```
## Available Metrics
Several different distances are available in `rnndescent` beyond the
typically-supported Euclidean and Cosine-based distances in other nearest
neighbor packages. See the [metrics](metrics.html) vignette for more details.
## Supported Data Types
* Dense matrices and data frames.
* Sparse matrices, in the `dgCMatrix`. All the same distances are supported as
for dense matrices.
* Additionally, for dense binary data, if you supply it as a `logical` matrix,
then for certain distances intended for binary data, specialized functions will
be used to speed up the computation.
## Parameters
There are several options that `rnnd_build` and `rnnd_query` expose that can
be modified to change the behavior of the different stages of the algorithm.
See the documentation for those functions (e.g. `?rnnd_build`) or the
[Random Partition Forests](random-partition-forests.html), [Nearest Neighbor Descent](nearest-neighbor-descent.html) and [Querying Data](querying-data.html)
vignettes for more details.
## References