Querying Data

library(rnndescent)

The usual pattern of usage for approximate nearest neighbors methods is:

  1. Build some kind of index with input data.
  2. Query that index with new data to find the nearest neighbors of your query.

This is how e.g. Annoy and hnswlib work. If you want just the k-nearest neighbors of the data you used to build the index in step 1, then you can just pass that data as the query in step 2, but rnndescent provides some specialized functions for this case that are slightly more efficient, see for example nnd_knn and rpf_knn. Nonetheless, querying the index with the original data can produce a more accurate result. See the hubness vignette for an example of that.

Below we will see some of the options that rnndescent has for querying an index.

For convenience, I will use all the even rows of the iris data to build an index, and search using the odd rows:

iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ]
iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ]

Brute Force

If your dataset is small enough, you can just use brute force to find the neighbors. No index to build, no worry about how approximate the results are:

brute_nbrs <- brute_force_knn_query(
  query = iris_odd,
  reference = iris_even,
  k = 15
)

The format of brute_nbrs is the usual k-nearest neighbors graph format, a list of two matrices, both of dimension (nrow(iris_odd), k). The first matrix, idx contains the indices of the nearest neighbors, and the second matrix, dist contains the distances to those neighbors (here I’ll just show the first five results per row):

lapply(brute_nbrs, function(m) {
  head(m[, 1:5])
})
#> $idx
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    9   20   14    4   25
#> [2,]   24    2   23   15    1
#> [3,]   19    9    4   20   14
#> [4,]   24    6   15    2   19
#> [5,]    2    7   24   23   15
#> [6,]   14   10    3   16   11
#> 
#> $dist
#>           [,1]      [,2]      [,3]      [,4]      [,5]
#> [1,] 0.1000000 0.1414213 0.1414213 0.1732050 0.2236068
#> [2,] 0.1414213 0.2449490 0.2645753 0.3000001 0.3000002
#> [3,] 0.1414213 0.1732050 0.2236066 0.2449488 0.2449488
#> [4,] 0.2236068 0.3000002 0.3162278 0.3316627 0.4123106
#> [5,] 0.2999998 0.3464101 0.3605550 0.4242641 0.4690414
#> [6,] 0.2828429 0.3316626 0.3464102 0.3605551 0.3605553

Random Projection Forests

If you build a random projection forest with rpf_build, you can query it with rpf_knn_query:

rpf_index <- rpf_build(iris_even)
rpf_nbrs <- rpf_knn_query(
  query = iris_odd,
  reference = iris_even,
  forest = rpf_index,
  k = 15
)

See the Random Partition Forests vignette for more.

Preparing the Search Graph

In all the examples so far, we have used the k-nearest neighbors graph as the reference_graph input to graph_knn_query. Is this actually a good idea? Probably not! There is no guarantee that all the items in the original dataset can actually be reached via the k-nearest neighbors graph. Some nodes just aren’t very popular and may not be in the neighbor list of any other item. That means you can never reach them via the k-nearest neighbors graph, no matter how thoroughly you search it.

We can solve this problem by reversing all the edges in the graph and adding them to the graph. So if you can get to item i from item j, you can now get to item j from item i. This solves one problem but adds some more which is that just like some items are very unpopular, other items might be very popular and appear often in the neighbor list of other items. Having a large number of these edges in the graph can make the search very slow. We therefore need to prune some of these edges.

prepare_search_graph is a function that will take a k-nearest neighbor graph and add edges to it to make it more useful for a search. The procedure is based on the process described in (Harwood and Drummond 2016) and consists of:

  1. Reversing all the edges in the graph.
  2. “Diversifying” the graph by “occlusion pruning”. This considers triplets of points, and removes long edges which are probably redundant. For an item i with neighbors p and q if the distances dpq < dip i.e. the neighbors are closer to each other than they are to i, then it is said that p occludes q and we don’t need both edges i → p and i → q – it’s likely that q is in the neighbor list of p or vice versa, so it’s unlikely that we are doing any harm by getting rid of i → p.
  3. After occlusion pruning, if any item still has an excessive number of edges, the longest edges are removed until the number of edges is below the threshold.

To control all this pruning the following parameters are available:

Diversification Probability

diversify_prob is the probability of a neighbor being removed if it is found to be an “occlusion”. This should take a value between 0 (no diversification) and 1 (remove as many edges as possible). The default is 1.0.

The DiskAnn/Vamana method’s pruning algorithm is almost identical but instead of a probability, uses a related parameter called alpha, which acts in the opposite direction: increasing alpha increases the density of the graph. Why am I telling you this? The pbbsbench implementation of PyNNDescent uses alpha instead of diversify_prob and in the accompanying paper (Dobson et al. 2023) they mention that the use of alpha yields “modest improvements” – from context this seems to mean relative to using diversify_prob = 1.0. I can’t give an exact mapping between the two values unfortunately.

Degree Pruning

pruning_degree_multiplier controls how many edges to remove after the occlusion pruning relative to the number of neighbors in the original nearest neighbor graph. The default is 1.5 which means to allow as many as 50% more edge than the original graph. So if the input graph was for k = 15, each item in the search graph will have at most 15 * 1.5 = 22 edges.

Let’s see how this works on the iris neighbors:

set.seed(42)
iris_search_graph <- prepare_search_graph(
  data = iris_even,
  graph = rpf_nbrs,
  diversify_prob = 0.1,
  pruning_degree_multiplier = 1.5
)

Because the returned search graph can contain different number of edges per item, the neighbor graph format isn’t suitable. Instead you get back a sparse matrix, specifically a dgCMatrix. Here’s a histogram of how the edges are distributed:

search_graph_edges <- diff(iris_search_graph@p)
hist(search_graph_edges,
  main = "Distribution of search graph edges", xlab = "# edges"
)

range(search_graph_edges)
#> [1]  7 22

So most items have around about k = 15 edges just like the nearest neighbor graph. But some have have the maximum number of edges and few have only 10 edges.

search_nbrs <- graph_knn_query(
  query = iris_odd,
  reference = iris_even,
  reference_graph = iris_search_graph,
  init = rpf_index,
  k = 15
)

References

Dobson, Magdalen, Zheqi Shen, Guy E Blelloch, Laxman Dhulipala, Yan Gu, Harsha Vardhan Simhadri, and Yihan Sun. 2023. “Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis.” arXiv Preprint arXiv:2305.04359.
Harwood, Ben, and Tom Drummond. 2016. “Fanng: Fast Approximate Nearest Neighbour Graphs.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5713–22.
Iwasaki, Masajiro, and Daisuke Miyazaki. 2018. “Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data.” arXiv Preprint arXiv:1810.07355.
Wang, Mengzhao, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. “A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search.” arXiv Preprint arXiv:2101.12631.