Title: | Nearest Neighbor Descent Method for Approximate Nearest Neighbors |
---|---|
Description: | The Nearest Neighbor Descent method for finding approximate nearest neighbors by Dong and co-workers (2010) <doi:10.1145/1963405.1963487>. Based on the 'Python' package 'PyNNDescent' <https://github.com/lmcinnes/pynndescent>. |
Authors: | James Melville [aut, cre, cph], Vitalie Spinu [ctb], Ralf Stubner [ctb] |
Maintainer: | James Melville <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.6.9000 |
Built: | 2024-10-26 04:57:40 UTC |
Source: | https://github.com/jlmelville/rnndescent |
Returns the exact nearest neighbors of a dataset. A brute force search is carried out: all possible pairs of points are compared, and the nearest neighbors are returned.
brute_force_knn( data, k, metric = "euclidean", use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
brute_force_knn( data, k, metric = "euclidean", use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
data |
Matrix of |
k |
Number of nearest neighbors to return. |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
use_alt_metric |
If |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
This method is accurate but scales poorly with dataset size, so use with caution with larger datasets. Having the exact neighbors as a ground truth to compare with approximate results is useful for benchmarking and determining parameter settings of the approximate methods.
the nearest neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
# Find the 4 nearest neighbors using Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean") # Manhattan (l1) distance iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan") # Multi-threading: you can choose the number of threads to use: in real # usage, you will want to set n_threads to at least 2 iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan", n_threads = 1) # Use verbose flag to see information about progress iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
# Find the 4 nearest neighbors using Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean") # Manhattan (l1) distance iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan") # Multi-threading: you can choose the number of threads to use: in real # usage, you will want to set n_threads to at least 2 iris_nn <- brute_force_knn(iris, k = 4, metric = "manhattan", n_threads = 1) # Use verbose flag to see information about progress iris_nn <- brute_force_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)
Returns the exact nearest neighbors of query data to the reference data. A brute force search is carried out: all possible pairs of reference and query points are compared, and the nearest neighbors are returned.
brute_force_knn_query( query, reference, k, metric = "euclidean", use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
brute_force_knn_query( query, reference, k, metric = "euclidean", use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
query |
Matrix of |
reference |
Matrix of |
k |
Number of nearest neighbors to return. |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
use_alt_metric |
If |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
This is accurate but scales poorly with dataset size, so use with caution with larger datasets. Having the exact neighbors as a ground truth to compare with approximate results is useful for benchmarking and determining parameter settings of the approximate methods.
the nearest neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices in
reference
.
dist
an n by k matrix containing the nearest neighbor distances to the
items in reference
.
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # For each item in iris_query find the 4 nearest neighbors in iris_ref # If you pass a data frame, non-numeric columns are removed # set verbose = TRUE to get details on the progress being made iris_query_nn <- brute_force_knn_query(iris_query, reference = iris_ref, k = 4, metric = "euclidean", verbose = TRUE ) # Manhattan (l1) distance iris_query_nn <- brute_force_knn_query(iris_query, reference = iris_ref, k = 4, metric = "manhattan" )
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # For each item in iris_query find the 4 nearest neighbors in iris_ref # If you pass a data frame, non-numeric columns are removed # set verbose = TRUE to get details on the progress being made iris_query_nn <- brute_force_knn_query(iris_query, reference = iris_ref, k = 4, metric = "euclidean", verbose = TRUE ) # Manhattan (l1) distance iris_query_nn <- brute_force_knn_query(iris_query, reference = iris_ref, k = 4, metric = "manhattan" )
Run queries against a search graph, to return nearest neighbors taken from the reference data used to build that graph.
graph_knn_query( query, reference, reference_graph, k = NULL, metric = "euclidean", init = NULL, epsilon = 0.1, max_search_fraction = 1, use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
graph_knn_query( query, reference, reference_graph, k = NULL, metric = "euclidean", init = NULL, epsilon = 0.1, max_search_fraction = 1, use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
query |
Matrix of |
reference |
Matrix of |
reference_graph |
Search graph of the |
k |
Number of nearest neighbors to return. Optional if |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
init |
Initial
|
epsilon |
Controls trade-off between accuracy and search cost, as
described by Iwasaki and Miyazaki (2018), by specifying a distance
tolerance on whether to explore the neighbors of candidate points. The
larger the value, the more neighbors will be searched. A value of 0.1
allows query-candidate distances to be 10% larger than the current
most-distant neighbor of the query point, 0.2 means 20%, and so on.
Suggested values are between 0-0.5, although this value is highly dependent
on the distribution of distances in the dataset (higher dimensional data
should choose a smaller cutoff). Too large a value of |
max_search_fraction |
Maximum fraction of the reference data to search.
This is a value between 0 (search none of the reference data) and 1 (search
all of the data if necessary). This works in conjunction with |
use_alt_metric |
If |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
A greedy beam search is used to query the graph, combining two search pruning
strategies. The first, due to Iwasaki and Miyazaki (2018), only considers
new candidates within a relative distance of the current furthest neighbor
in the query's graph. The second, due to Harwood and Drummond (2016), puts a
limit on the absolute number of distance calculations to carry out. See the
epsilon
and max_search_fraction
parameters respectively.
the approximate nearest neighbor graph as a list containing:
idx
a n
by k
matrix containing the nearest neighbor indices
specifying the row of the neighbor in reference
.
dist
a n
by k
matrix containing the nearest neighbor distances.
Harwood, B., & Drummond, T. (2016). Fanng: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5713-5722).
Iwasaki, M., & Miyazaki, D. (2018). Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data. arXiv preprint arXiv:1810.07355.
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # First, find the approximate 4-nearest neighbor graph for the references: iris_ref_graph <- nnd_knn(iris_ref, k = 4) # For each item in iris_query find the 4 nearest neighbors in iris_ref. # You need to pass both the reference data and the reference graph. # If you pass a data frame, non-numeric columns are removed. # set verbose = TRUE to get details on the progress being made iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_graph, k = 4, metric = "euclidean", verbose = TRUE ) # A more complete example, converting the initial knn into a search graph # and using a filtered random projection forest to initialize the search # create initial knn and forest iris_ref_graph <- nnd_knn(iris_ref, k = 4, init = "tree", ret_forest = TRUE) # keep the best tree in the forest forest <- rpf_filter(iris_ref_graph, n_trees = 1) # expand the knn into a search graph iris_ref_search_graph <- prepare_search_graph(iris_ref, iris_ref_graph) # run the query with the improved graph and initialization iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_search_graph, init = forest, k = 4 )
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # First, find the approximate 4-nearest neighbor graph for the references: iris_ref_graph <- nnd_knn(iris_ref, k = 4) # For each item in iris_query find the 4 nearest neighbors in iris_ref. # You need to pass both the reference data and the reference graph. # If you pass a data frame, non-numeric columns are removed. # set verbose = TRUE to get details on the progress being made iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_graph, k = 4, metric = "euclidean", verbose = TRUE ) # A more complete example, converting the initial knn into a search graph # and using a filtered random projection forest to initialize the search # create initial knn and forest iris_ref_graph <- nnd_knn(iris_ref, k = 4, init = "tree", ret_forest = TRUE) # keep the best tree in the forest forest <- rpf_filter(iris_ref_graph, n_trees = 1) # expand the knn into a search graph iris_ref_search_graph <- prepare_search_graph(iris_ref, iris_ref_graph) # run the query with the improved graph and initialization iris_query_nn <- graph_knn_query(iris_query, iris_ref, iris_ref_search_graph, init = forest, k = 4 )
k_occur
returns a vector of the k-occurrences of a nearest neighbor graph
as defined by Radovanovic and co-workers (2010). The k-occurrence of an
object is the number of times it occurs among the k-nearest neighbors of
objects in a dataset.
k_occur(idx, k = NULL, include_self = TRUE)
k_occur(idx, k = NULL, include_self = TRUE)
idx |
integer matrix containing the nearest neighbor indices, integers
labeled starting at 1. Note that the integer labels do not have to
refer to the rows of |
k |
The number of closest neighbors to use. Must be between 1 and the
number of columns in |
include_self |
logical indicating whether the label |
The k-occurrence can take values between 0 and the size of the dataset. The larger the k-occurrence for an object, the more "popular" it is. Very large values of the k-occurrence (much larger than k) indicates that an object is a "hub" and also implies the existence of "anti-hubs": objects that never appear as k-nearest neighbors of other objects.
The presence of hubs can reduce the accuracy of nearest-neighbor descent and other approximate nearest neighbor algorithms in terms of retrieving the exact k-nearest neighbors. However the appearance of hubs can still be detected in these approximate results, so calculating the k-occurrences for the output of nearest neighbor descent is a useful diagnostic step.
a vector of length max(idx)
, containing the number of times an
object in idx
was found in the nearest neighbor list of the objects
represented by the row indices of idx
.
Radovanovic, M., Nanopoulos, A., & Ivanovic, M. (2010). Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11, 2487-2531. https://www.jmlr.org/papers/v11/radovanovic10a.html
Bratic, B., Houle, M. E., Kurbalija, V., Oria, V., & Radovanovic, M. (2019). The Influence of Hubness on NN-Descent. International Journal on Artificial Intelligence Tools, 28(06), 1960002. doi:10.1142/S0218213019600029
iris_nbrs <- brute_force_knn(iris, k = 15) iris_ko <- k_occur(iris_nbrs$idx) # items 42 and 107 are not in 15 nearest neighbors of any other members of # iris which(iris_ko == 1) # they are only their own nearest neighbor max(iris_ko) # most "popular" item appears on 29 15-nearest neighbor lists which(iris_ko == max(iris_ko)) # it's iris item 64 # with k = 15, a maximum k-occurrence = 29 ~= 1.9 * k, which is not a cause # for concern
iris_nbrs <- brute_force_knn(iris, k = 15) iris_ko <- k_occur(iris_nbrs$idx) # items 42 and 107 are not in 15 nearest neighbors of any other members of # iris which(iris_ko == 1) # they are only their own nearest neighbor max(iris_ko) # most "popular" item appears on 29 15-nearest neighbor lists which(iris_ko == max(iris_ko)) # it's iris item 64 # with k = 15, a maximum k-occurrence = 29 ~= 1.9 * k, which is not a cause # for concern
merge_knn
takes a list of nearest neighbor graphs and merges them into a
single graph, with the same number of neighbors as the first graph. This is
useful to combine the results of multiple different nearest neighbor
searches: the output will be at least as accurate as the most accurate of the
two input graphs, and ideally will be more accurate than either.
merge_knn(graphs, is_query = FALSE, n_threads = 0, verbose = FALSE)
merge_knn(graphs, is_query = FALSE, n_threads = 0, verbose = FALSE)
graphs |
A list of nearest neighbor graphs to merge. Each item in the list should consist of a sub-list containing:
|
is_query |
If |
n_threads |
Number of threads to use. |
verbose |
If |
a list containing:
idx
an n by k matrix containing the merged nearest neighbor indices.
dist
an n by k matrix containing the merged nearest neighbor distances.
The size of k
in the output graph is the same as that of the first
item in nn_graphs
.
set.seed(1337) # Nearest neighbor descent with 15 neighbors for iris three times, # starting from a different random initialization each time iris_rnn1 <- nnd_knn(iris, k = 15, n_iters = 1) iris_rnn2 <- nnd_knn(iris, k = 15, n_iters = 1) iris_rnn3 <- nnd_knn(iris, k = 15, n_iters = 1) # Merged results should be an improvement over individual results iris_mnn <- merge_knn(list(iris_rnn1, iris_rnn2, iris_rnn3)) sum(iris_mnn$dist) < sum(iris_rnn1$dist) sum(iris_mnn$dist) < sum(iris_rnn2$dist) sum(iris_mnn$dist) < sum(iris_rnn3$dist)
set.seed(1337) # Nearest neighbor descent with 15 neighbors for iris three times, # starting from a different random initialization each time iris_rnn1 <- nnd_knn(iris, k = 15, n_iters = 1) iris_rnn2 <- nnd_knn(iris, k = 15, n_iters = 1) iris_rnn3 <- nnd_knn(iris, k = 15, n_iters = 1) # Merged results should be an improvement over individual results iris_mnn <- merge_knn(list(iris_rnn1, iris_rnn2, iris_rnn3)) sum(iris_mnn$dist) < sum(iris_rnn1$dist) sum(iris_mnn$dist) < sum(iris_rnn2$dist) sum(iris_mnn$dist) < sum(iris_rnn3$dist)
Calculates the mean average number of neighbors in common between the two graphs. The per-item overlap can also be returned. This function can be useful as a measure of accuracy of approximation algorithms, if the exact nearest neighbors are known, or as a measure of diversity of two different approximate graphs.
neighbor_overlap(idx1, idx2, k = NULL, ret_vec = FALSE)
neighbor_overlap(idx1, idx2, k = NULL, ret_vec = FALSE)
idx1 |
Indices of a nearest neighbor graph, i.e. a matrix of nearest
neighbor indices. Can also be a list containing an |
idx2 |
Indices of a nearest neighbor graph, i.e. a matrix of nearest
neighbor indices. Can also be a list containing an |
k |
Number of neighbors to consider. If |
ret_vec |
If |
The graph format is the same as that returned by e.g. nnd_knn()
and should
be of dimensions n by k, where n is the number of points and k is the number
of neighbors. If you pass a neighbor graph directly, the index matrix will be
extracted if present. If the two graphs have different numbers of neighbors,
then the smaller number of neighbors is used.
The mean overlap between idx1
and idx2
. If ret_vec = TRUE
,
then a list containing the mean overlap and the overlap of each item in
is returned with names mean
and overlaps
, respectively.
set.seed(1337) # Generate two random neighbor graphs for iris iris_rnn1 <- random_knn(iris, k = 15) iris_rnn2 <- random_knn(iris, k = 15) # Overlap between the two graphs mean_overlap <- neighbor_overlap(iris_rnn1, iris_rnn2) # Also get a vector of per-item overlap overlap_res <- neighbor_overlap(iris_rnn1, iris_rnn2, ret_vec = TRUE) summary(overlap_res$overlaps)
set.seed(1337) # Generate two random neighbor graphs for iris iris_rnn1 <- random_knn(iris, k = 15) iris_rnn2 <- random_knn(iris, k = 15) # Overlap between the two graphs mean_overlap <- neighbor_overlap(iris_rnn1, iris_rnn2) # Also get a vector of per-item overlap overlap_res <- neighbor_overlap(iris_rnn1, iris_rnn2, ret_vec = TRUE) summary(overlap_res$overlaps)
Uses the Nearest Neighbor Descent method due to Dong and co-workers (2011) to optimize an approximate nearest neighbor graph.
nnd_knn( data, k = NULL, metric = "euclidean", init = "rand", init_args = NULL, n_iters = NULL, max_candidates = NULL, delta = 0.001, low_memory = TRUE, weight_by_degree = FALSE, use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, progress = "bar", obs = "R", ret_forest = FALSE )
nnd_knn( data, k = NULL, metric = "euclidean", init = "rand", init_args = NULL, n_iters = NULL, max_candidates = NULL, delta = 0.001, low_memory = TRUE, weight_by_degree = FALSE, use_alt_metric = TRUE, n_threads = 0, verbose = FALSE, progress = "bar", obs = "R", ret_forest = FALSE )
data |
Matrix of |
k |
Number of nearest neighbors to return. Optional if |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
init |
Name of the initialization strategy or initial
If
|
init_args |
a list containing arguments to pass to the random partition
forest initialization. See |
n_iters |
Number of iterations of nearest neighbor descent to carry out.
By default, this will be chosen based on the number of observations in
|
max_candidates |
Maximum number of candidate neighbors to try for each
item in each iteration. Use relative to |
delta |
The minimum relative change in the neighbor graph allowed before
early stopping. Should be a value between 0 and 1. The smaller the value,
the smaller the amount of progress between iterations is allowed. Default
value of |
low_memory |
If |
weight_by_degree |
If |
use_alt_metric |
If |
n_threads |
Number of threads to use. |
verbose |
If |
progress |
Determines the type of progress information logged if
|
obs |
set to |
ret_forest |
If |
If no initial graph is provided, a random graph is generated, or you may also specify the use of a graph generated from a forest of random projection trees, using the method of Dasgupta and Freund (2008).
the approximate nearest neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
forest
(if init = "tree"
and ret_forest = TRUE
only): the RP forest
used to initialize the neighbor data.
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452.
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World Wide Web (pp. 577-586). ACM. doi:10.1145/1963405.1963487.
# Find 4 (approximate) nearest neighbors using Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean") # Manhattan (l1) distance iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan") # Multi-threading: you can choose the number of threads to use: in real # usage, you will want to set n_threads to at least 2 iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan", n_threads = 1) # Use verbose flag to see information about progress iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", verbose = TRUE) # Nearest neighbor descent uses random initialization, but you can pass any # approximation using the init argument (as long as the metrics used to # calculate the initialization are compatible with the metric options used # by nnd_knn). iris_nn <- random_knn(iris, k = 4, metric = "euclidean") iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE) # Number of iterations controls how much optimization is attempted. A smaller # value will run faster but give poorer results iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 2) # You can also control the amount of work done within an iteration by # setting max_candidates iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", max_candidates = 50) # Optimization may also stop early if not much progress is being made. This # convergence criterion can be controlled via delta. A larger value will # stop progress earlier. The verbose flag will provide some information if # convergence is occurring before all iterations are carried out. set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0.5) # To ensure that descent only stops if no improvements are made, set delta = 0 set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0) # A faster version of the algorithm is available that avoids repeated # distance calculations at the cost of using more RAM. Set low_memory to # FALSE to try it. set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", low_memory = FALSE) # Using init = "tree" is usually more efficient than random initialization. # arguments to the tree initialization method can be passed via the init_args # list set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, init = "tree", init_args = list(n_trees = 5))
# Find 4 (approximate) nearest neighbors using Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean") # Manhattan (l1) distance iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan") # Multi-threading: you can choose the number of threads to use: in real # usage, you will want to set n_threads to at least 2 iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan", n_threads = 1) # Use verbose flag to see information about progress iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", verbose = TRUE) # Nearest neighbor descent uses random initialization, but you can pass any # approximation using the init argument (as long as the metrics used to # calculate the initialization are compatible with the metric options used # by nnd_knn). iris_nn <- random_knn(iris, k = 4, metric = "euclidean") iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE) # Number of iterations controls how much optimization is attempted. A smaller # value will run faster but give poorer results iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 2) # You can also control the amount of work done within an iteration by # setting max_candidates iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", max_candidates = 50) # Optimization may also stop early if not much progress is being made. This # convergence criterion can be controlled via delta. A larger value will # stop progress earlier. The verbose flag will provide some information if # convergence is occurring before all iterations are carried out. set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0.5) # To ensure that descent only stops if no improvements are made, set delta = 0 set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0) # A faster version of the algorithm is available that avoids repeated # distance calculations at the cost of using more RAM. Set low_memory to # FALSE to try it. set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", low_memory = FALSE) # Using init = "tree" is usually more efficient than random initialization. # arguments to the tree initialization method can be passed via the init_args # list set.seed(1337) iris_nn <- nnd_knn(iris, k = 4, init = "tree", init_args = list(n_trees = 5))
Create a graph using existing nearest neighbor data to balance search speed and accuracy using the occlusion pruning and truncation strategies of Harwood and Drummond (2016). The resulting search graph should be more efficient for querying new data than the original nearest neighbor graph.
prepare_search_graph( data, graph, metric = "euclidean", use_alt_metric = TRUE, diversify_prob = 1, pruning_degree_multiplier = 1.5, prune_reverse = FALSE, n_threads = 0, verbose = FALSE, obs = "R" )
prepare_search_graph( data, graph, metric = "euclidean", use_alt_metric = TRUE, diversify_prob = 1, pruning_degree_multiplier = 1.5, prune_reverse = FALSE, n_threads = 0, verbose = FALSE, obs = "R" )
data |
Matrix of |
graph |
neighbor graph for
|
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
use_alt_metric |
If |
diversify_prob |
the degree of diversification of the search graph
by removing unnecessary edges through occlusion pruning. This should take a
value between |
pruning_degree_multiplier |
How strongly to truncate the final neighbor
list for each item. The neighbor list of each item will be truncated to
retain only the closest |
prune_reverse |
If |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
An approximate nearest neighbor graph is not very useful for querying via
graph_knn_query()
, especially if the query data is initialized randomly:
some items in the data set may not be in the nearest neighbor list of any
other item and can therefore never be returned as a neighbor, no matter how
close they are to the query. Even those which do appear in at least one
neighbor list may not be reachable by expanding an arbitrary starting list if
the neighbor graph contains disconnected components.
Converting the directed graph represented by the neighbor graph to an
undirected graph by adding an edge from item j
to i
if
an edge exists from i
to j
(i.e. creating the mutual neighbor
graph) solves the problems above, but can result in inefficient searches.
Although the out-degree of each item is restricted to the number of neighbors
the in-degree has no such restrictions: a given item could be very "popular"
and in a large number of neighbors lists. Therefore mutualizing the neighbor
graph can result in some items with a large number of neighbors to search.
These usually have very similar neighborhoods so there is nothing to be
gained from searching all of them.
To balance accuracy and search time, the following procedure is carried out:
The graph is "diversified" by occlusion pruning.
The reverse graph is formed by reversing the direction of all edges in the pruned graph.
The reverse graph is diversified by occlusion pruning.
The pruned forward and pruned reverse graph are merged.
The outdegree of each node in the merged graph is truncated.
The truncated merged graph is returned as the prepared search graph.
Explicit zero distances in the graph
will be converted to a small positive
number to avoid being dropped in the sparse representation. The one exception
is the "self" distance, i.e. any edge in the graph
which links a node to
itself (the diagonal of the sparse distance matrix). These trivial edges
aren't useful for search purposes and are always dropped.
a search graph for data
based on graph
, represented as a sparse
matrix, suitable for use with graph_knn_query()
.
Harwood, B., & Drummond, T. (2016). Fanng: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5713-5722).
Jayaram Subramanya, S., Devvrit, F., Simhadri, H. V., Krishnawamy, R., & Kadekodi, R. (2019). Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems, 32.
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # First, find the approximate 4-nearest neighbor graph for the references: ref_ann_graph <- nnd_knn(iris_ref, k = 4) # Create a graph for querying with ref_search_graph <- prepare_search_graph(iris_ref, ref_ann_graph) # Using the search graph rather than the ref_ann_graph directly may give # more accurate or faster results iris_query_nn <- graph_knn_query( query = iris_query, reference = iris_ref, reference_graph = ref_search_graph, k = 4, metric = "euclidean", verbose = TRUE )
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # First, find the approximate 4-nearest neighbor graph for the references: ref_ann_graph <- nnd_knn(iris_ref, k = 4) # Create a graph for querying with ref_search_graph <- prepare_search_graph(iris_ref, ref_ann_graph) # Using the search graph rather than the ref_ann_graph directly may give # more accurate or faster results iris_query_nn <- graph_knn_query( query = iris_query, reference = iris_ref, reference_graph = ref_search_graph, k = 4, metric = "euclidean", verbose = TRUE )
Create a neighbor graph by randomly selecting neighbors. This is not a useful
nearest neighbor method on its own, but can be used with other methods which
require initialization, such as nnd_knn()
.
random_knn( data, k, metric = "euclidean", use_alt_metric = TRUE, order_by_distance = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
random_knn( data, k, metric = "euclidean", use_alt_metric = TRUE, order_by_distance = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
data |
Matrix of |
k |
Number of nearest neighbors to return. |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
use_alt_metric |
If |
order_by_distance |
If |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
a random neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
# Find 4 random neighbors and calculate their Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- random_knn(iris, k = 4, metric = "euclidean") # Manhattan (l1) distance iris_nn <- random_knn(iris, k = 4, metric = "manhattan") # Multi-threading: you can choose the number of threads to use: in real # usage, you will want to set n_threads to at least 2 iris_nn <- random_knn(iris, k = 4, metric = "manhattan", n_threads = 1) # Use verbose flag to see information about progress iris_nn <- random_knn(iris, k = 4, metric = "euclidean", verbose = TRUE) # These results can be improved by nearest neighbors descent. You don't need # to specify k here because this is worked out from the initial input iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)
# Find 4 random neighbors and calculate their Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- random_knn(iris, k = 4, metric = "euclidean") # Manhattan (l1) distance iris_nn <- random_knn(iris, k = 4, metric = "manhattan") # Multi-threading: you can choose the number of threads to use: in real # usage, you will want to set n_threads to at least 2 iris_nn <- random_knn(iris, k = 4, metric = "manhattan", n_threads = 1) # Use verbose flag to see information about progress iris_nn <- random_knn(iris, k = 4, metric = "euclidean", verbose = TRUE) # These results can be improved by nearest neighbors descent. You don't need # to specify k here because this is worked out from the initial input iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)
Run queries against reference data to return randomly selected neighbors. This is not a useful query method on its own, but can be used with other methods which require initialization.
random_knn_query( query, reference, k, metric = "euclidean", use_alt_metric = TRUE, order_by_distance = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
random_knn_query( query, reference, k, metric = "euclidean", use_alt_metric = TRUE, order_by_distance = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
query |
Matrix of |
reference |
Matrix of |
k |
Number of nearest neighbors to return. |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
use_alt_metric |
If |
order_by_distance |
If |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
an approximate nearest neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # For each item in iris_query find 4 random neighbors in iris_ref # If you pass a data frame, non-numeric columns are removed # set verbose = TRUE to get details on the progress being made iris_query_random_nbrs <- random_knn_query(iris_query, reference = iris_ref, k = 4, metric = "euclidean", verbose = TRUE ) # Manhattan (l1) distance iris_query_random_nbrs <- random_knn_query(iris_query, reference = iris_ref, k = 4, metric = "manhattan" )
# 100 reference iris items iris_ref <- iris[iris$Species %in% c("setosa", "versicolor"), ] # 50 query items iris_query <- iris[iris$Species == "versicolor", ] # For each item in iris_query find 4 random neighbors in iris_ref # If you pass a data frame, non-numeric columns are removed # set verbose = TRUE to get details on the progress being made iris_query_random_nbrs <- random_knn_query(iris_query, reference = iris_ref, k = 4, metric = "euclidean", verbose = TRUE ) # Manhattan (l1) distance iris_query_random_nbrs <- random_knn_query(iris_query, reference = iris_ref, k = 4, metric = "manhattan" )
This function builds an approximate nearest neighbors graph with convenient
defaults, then prepares the index for querying new data, for later use with
rnnd_query()
. For more control over the process, please see the other
functions in the package.
rnnd_build( data, k = 30, metric = "euclidean", use_alt_metric = TRUE, init = "tree", n_trees = NULL, leaf_size = NULL, max_tree_depth = 200, margin = "auto", n_iters = NULL, delta = 0.001, max_candidates = NULL, low_memory = TRUE, weight_by_degree = FALSE, n_search_trees = 1, pruning_degree_multiplier = 1.5, diversify_prob = 1, prune_reverse = FALSE, n_threads = 0, verbose = FALSE, progress = "bar", obs = "R" )
rnnd_build( data, k = 30, metric = "euclidean", use_alt_metric = TRUE, init = "tree", n_trees = NULL, leaf_size = NULL, max_tree_depth = 200, margin = "auto", n_iters = NULL, delta = 0.001, max_candidates = NULL, low_memory = TRUE, weight_by_degree = FALSE, n_search_trees = 1, pruning_degree_multiplier = 1.5, diversify_prob = 1, prune_reverse = FALSE, n_threads = 0, verbose = FALSE, progress = "bar", obs = "R" )
data |
Matrix of |
k |
Number of nearest neighbors to build the index for. You can specify
a different number when running |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
use_alt_metric |
If |
init |
Name of the initialization strategy or initial
If
|
n_trees |
The number of trees to use in the RP forest. A larger number
will give more accurate results at the cost of a longer computation time.
The default of |
leaf_size |
The maximum number of items that can appear in a leaf. This
value should be chosen to match the expected number of neighbors you will
want to retrieve when running queries (e.g. if you want find 50 nearest
neighbors set |
max_tree_depth |
The maximum depth of the tree to build (default = 200).
If the maximum tree depth is exceeded then the leaf size of a tree may
exceed |
margin |
A character string specifying the method used to assign points to one side of the hyperplane or the other. Possible values are:
Only used if |
n_iters |
Number of iterations of nearest neighbor descent to carry out.
By default, this will be chosen based on the number of observations in
|
delta |
The minimum relative change in the neighbor graph allowed before
early stopping. Should be a value between 0 and 1. The smaller the value,
the smaller the amount of progress between iterations is allowed. Default
value of |
max_candidates |
Maximum number of candidate neighbors to try for each
item in each iteration. Use relative to |
low_memory |
If |
weight_by_degree |
If |
n_search_trees |
the number of trees to keep in the search forest as
part of index preparation. The default is |
pruning_degree_multiplier |
How strongly to truncate the final neighbor
list for each item. The neighbor list of each item will be truncated to
retain only the closest |
diversify_prob |
the degree of diversification of the search graph
by removing unnecessary edges through occlusion pruning. This should take a
value between |
prune_reverse |
If |
n_threads |
Number of threads to use. |
verbose |
If |
progress |
Determines the type of progress information logged during the
nearest neighbor descent stage when
|
obs |
set to |
The process of k-nearest neighbor graph construction using Random Projection Forests (Dasgupta and Freund, 2008) for initialization and Nearest Neighbor Descent (Dong and co-workers, 2011) for refinement. Index preparation, uses the graph diversification method of Harwood and Drummond (2016).
the approximate nearest neighbor index, a list containing:
graph
the k-nearest neighbor graph, a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
Other list items are intended only for internal use by other functions
such as rnnd_query()
.
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452.
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World Wide Web (pp. 577-586). ACM. doi:10.1145/1963405.1963487.
Harwood, B., & Drummond, T. (2016). Fanng: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5713-5722).
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] # Find 4 (approximate) nearest neighbors using Euclidean distance iris_even_index <- rnnd_build(iris_even, k = 4) iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] # Find 4 (approximate) nearest neighbors using Euclidean distance iris_even_index <- rnnd_build(iris_even, k = 4) iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
This function builds an approximate nearest neighbors graph of the provided data using convenient defaults. It does not return an index for later querying, to speed the graph construction and reduce the size and complexity of the return value.
rnnd_knn( data, k = 30, metric = "euclidean", use_alt_metric = TRUE, init = "tree", n_trees = NULL, leaf_size = NULL, max_tree_depth = 200, margin = "auto", n_iters = NULL, delta = 0.001, max_candidates = NULL, weight_by_degree = FALSE, low_memory = TRUE, n_threads = 0, verbose = FALSE, progress = "bar", obs = "R" )
rnnd_knn( data, k = 30, metric = "euclidean", use_alt_metric = TRUE, init = "tree", n_trees = NULL, leaf_size = NULL, max_tree_depth = 200, margin = "auto", n_iters = NULL, delta = 0.001, max_candidates = NULL, weight_by_degree = FALSE, low_memory = TRUE, n_threads = 0, verbose = FALSE, progress = "bar", obs = "R" )
data |
Matrix of |
k |
Number of nearest neighbors to return. Optional if |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
|
use_alt_metric |
If |
init |
Name of the initialization strategy or initial
If
|
n_trees |
The number of trees to use in the RP forest. A larger number
will give more accurate results at the cost of a longer computation time.
The default of |
leaf_size |
The maximum number of items that can appear in a leaf. This
value should be chosen to match the expected number of neighbors you will
want to retrieve when running queries (e.g. if you want find 50 nearest
neighbors set |
max_tree_depth |
The maximum depth of the tree to build (default = 200).
If the maximum tree depth is exceeded then the leaf size of a tree may
exceed |
margin |
A character string specifying the method used to assign points to one side of the hyperplane or the other. Possible values are:
Only used if |
n_iters |
Number of iterations of nearest neighbor descent to carry out.
By default, this will be chosen based on the number of observations in
|
delta |
The minimum relative change in the neighbor graph allowed before
early stopping. Should be a value between 0 and 1. The smaller the value,
the smaller the amount of progress between iterations is allowed. Default
value of |
max_candidates |
Maximum number of candidate neighbors to try for each
item in each iteration. Use relative to |
weight_by_degree |
If |
low_memory |
If |
n_threads |
Number of threads to use. |
verbose |
If |
progress |
Determines the type of progress information logged during the
nearest neighbor descent stage when
|
obs |
set to |
The process of k-nearest neighbor graph construction using Random Projection
Forests (Dasgupta and Freund, 2008) for initialization and Nearest Neighbor
Descent (Dong and co-workers, 2011) for refinement. If you are sure you will
not want to query new data then compared to rnnd_build()
this function has
the advantage of not storing the index, which can be very large.
the approximate nearest neighbor index, a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452.
Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World Wide Web (pp. 577-586). ACM. doi:10.1145/1963405.1963487.
# Find 4 (approximate) nearest neighbors using Euclidean distance iris_knn <- rnnd_knn(iris, k = 4)
# Find 4 (approximate) nearest neighbors using Euclidean distance iris_knn <- rnnd_knn(iris, k = 4)
Takes a nearest neighbor index produced by rnnd_build()
and uses it to
find the nearest neighbors of a query set of observations, using a
back-tracking search with the search size determined by the method of
Iwasaki and Miyazaki (2018). For further control over the search effort, the
total number of distance calculations can also be bounded, similar to the
method of Harwood and Drummond (2016).
rnnd_query( index, query, k = 30, epsilon = 0.1, max_search_fraction = 1, init = NULL, n_threads = 0, verbose = FALSE, obs = "R" )
rnnd_query( index, query, k = 30, epsilon = 0.1, max_search_fraction = 1, init = NULL, n_threads = 0, verbose = FALSE, obs = "R" )
index |
A nearest neighbor index produced by |
query |
Matrix of |
k |
Number of nearest neighbors to return. |
epsilon |
Controls trade-off between accuracy and search cost, as
described by Iwasaki and Miyazaki (2018). Setting |
max_search_fraction |
Maximum fraction of the reference data to search.
This is a value between 0 (search none of the reference data) and 1 (search
all of the data if necessary). This works in conjunction with |
init |
An optional matrix of |
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
the approximate nearest neighbor index, a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
Harwood, B., & Drummond, T. (2016). Fanng: Fast approximate nearest neighbour graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5713-5722).
Iwasaki, M., & Miyazaki, D. (2018). Optimization of indexing based on k-nearest neighbor graph for proximity search in high-dimensional data. arXiv preprint arXiv:1810.07355. https://arxiv.org/abs/1810.07355
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] iris_even_index <- rnnd_build(iris_even, k = 4) iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] iris_even_index <- rnnd_build(iris_even, k = 4) iris_odd_nbrs <- rnnd_query(index = iris_even_index, query = iris_odd, k = 4)
Builds a "forest" of Random Projection Trees (Dasgupta and Freund, 2008), which can later be searched to find approximate nearest neighbors.
rpf_build( data, metric = "euclidean", use_alt_metric = TRUE, n_trees = NULL, leaf_size = 10, max_tree_depth = 200, margin = "auto", n_threads = 0, verbose = FALSE, obs = "R" )
rpf_build( data, metric = "euclidean", use_alt_metric = TRUE, n_trees = NULL, leaf_size = 10, max_tree_depth = 200, margin = "auto", n_threads = 0, verbose = FALSE, obs = "R" )
data |
Matrix of |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
Note that if |
use_alt_metric |
If |
n_trees |
The number of trees to use in the RP forest. A larger number
will give more accurate results at the cost of a longer computation time.
The default of |
leaf_size |
The maximum number of items that can appear in a leaf. This
value should be chosen to match the expected number of neighbors you will
want to retrieve when running queries (e.g. if you want find 50 nearest
neighbors set |
max_tree_depth |
The maximum depth of the tree to build (default = 200).
If the maximum tree depth is exceeded then the leaf size of a tree may
exceed |
margin |
A character string specifying the method used to assign points to one side of the hyperplane or the other. Possible values are:
|
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
a forest of random projection trees as a list. Each tree in the
forest is a further list, but is not intended to be examined or manipulated
by the user. As a normal R data type, it can be safely serialized and
deserialized with base::saveRDS()
and base::readRDS()
. To use it for
querying pass it as the forest
parameter of rpf_knn_query()
. The forest
does not store any of the data
passed into build the tree, so if you
are going to search the forest, you will also need to store the data
used
to build it and provide it during the search.
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452.
# Build a forest of 10 trees from the odd rows iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] iris_odd_forest <- rpf_build(iris_odd, n_trees = 10) iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_even_nn <- rpf_knn_query( query = iris_even, reference = iris_odd, forest = iris_odd_forest, k = 15 )
# Build a forest of 10 trees from the odd rows iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] iris_odd_forest <- rpf_build(iris_odd, n_trees = 10) iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_even_nn <- rpf_knn_query( query = iris_even, reference = iris_odd, forest = iris_odd_forest, k = 15 )
Reduce the size of a random projection forest, by scoring each tree against a k-nearest neighbors graph. Only the top N trees will be retained which allows for a faster querying.
rpf_filter(nn, forest = NULL, n_trees = 1, n_threads = 0, verbose = FALSE)
rpf_filter(nn, forest = NULL, n_trees = 1, n_threads = 0, verbose = FALSE)
nn |
Nearest neighbor data in the dense list format. This should be
derived from the same data that was used to build the |
forest |
A random partition forest, e.g. created by |
n_trees |
The number of trees to retain. By default only the best-scoring tree is retained. |
n_threads |
Number of threads to use. |
verbose |
If |
Trees are scored based on how well each leaf reflects the neighbors as
specified in the nearest neighbor data. It's best to use as accurate nearest
neighbor data as you can and it does not need to come directly from
searching the forest
: for example, the nearest neighbor data from running
nnd_knn()
to optimize the neighbor data output from an RP Forest is a
good choice.
Rather than rely on an RP Forest solely for approximate nearest neighbor
querying, it is probably more cost-effective to use a small number of trees
to initialize the neighbor list for use in a graph search via
graph_knn_query()
.
A forest with the best scoring n_trees
trees.
# Build a knn with a forest of 10 trees using the odd rows iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] # also return the forest with the knn rfknn <- rpf_knn(iris_odd, k = 15, n_trees = 10, ret_forest = TRUE) # keep the best 2 trees: iris_odd_filtered_forest <- rpf_filter(rfknn) # get some new data to search iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] # search with the filtered forest iris_even_nn <- rpf_knn_query( query = iris_even, reference = iris_odd, forest = iris_odd_filtered_forest, k = 15 )
# Build a knn with a forest of 10 trees using the odd rows iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] # also return the forest with the knn rfknn <- rpf_knn(iris_odd, k = 15, n_trees = 10, ret_forest = TRUE) # keep the best 2 trees: iris_odd_filtered_forest <- rpf_filter(rfknn) # get some new data to search iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] # search with the filtered forest iris_even_nn <- rpf_knn_query( query = iris_even, reference = iris_odd, forest = iris_odd_filtered_forest, k = 15 )
Returns the approximate k-nearest neighbor graph of a dataset by searching multiple random projection trees, a variant of k-d trees originated by Dasgupta and Freund (2008).
rpf_knn( data, k, metric = "euclidean", use_alt_metric = TRUE, n_trees = NULL, leaf_size = NULL, max_tree_depth = 200, include_self = TRUE, ret_forest = FALSE, margin = "auto", n_threads = 0, verbose = FALSE, obs = "R" )
rpf_knn( data, k, metric = "euclidean", use_alt_metric = TRUE, n_trees = NULL, leaf_size = NULL, max_tree_depth = 200, include_self = TRUE, ret_forest = FALSE, margin = "auto", n_threads = 0, verbose = FALSE, obs = "R" )
data |
Matrix of |
k |
Number of nearest neighbors to return. Optional if |
metric |
Type of distance calculation to use. One of:
For non-sparse data, the following variants are available with preprocessing: this trades memory for a potential speed up during the distance calculation. Some minor numerical differences should be expected compared to the non-preprocessed versions:
For non-sparse binary data passed as a
Note that if |
use_alt_metric |
If |
n_trees |
The number of trees to use in the RP forest. A larger number
will give more accurate results at the cost of a longer computation time.
The default of |
leaf_size |
The maximum number of items that can appear in a leaf. The
default of |
max_tree_depth |
The maximum depth of the tree to build (default = 200).
If the maximum tree depth is exceeded then the leaf size of a tree may
exceed |
include_self |
If |
ret_forest |
If |
margin |
A character string specifying the method used to assign points to one side of the hyperplane or the other. Possible values are:
|
n_threads |
Number of threads to use. |
verbose |
If |
obs |
set to |
the approximate nearest neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
forest
(if ret_forest = TRUE
) the RP forest that generated the
neighbor graph, which can be used to query new data.
k
neighbors per observation are not guaranteed to be found. Missing data
is represented with an index of 0
and a distance of NA
.
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452.
# Find 4 (approximate) nearest neighbors using Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- rpf_knn(iris, k = 4, metric = "euclidean", leaf_size = 3) # If you want to initialize another method (e.g. nearest neighbor descent) # with the result of the RP forest, then it's more efficient to skip # evaluating whether an item is a neighbor of itself by setting # `include_self = FALSE`: iris_rp <- rpf_knn(iris, k = 4, n_trees = 3, include_self = FALSE) # for future querying you may want to also return the RP forest: iris_rpf <- rpf_knn(iris, k = 4, n_trees = 3, include_self = FALSE, ret_forest = TRUE )
# Find 4 (approximate) nearest neighbors using Euclidean distance # If you pass a data frame, non-numeric columns are removed iris_nn <- rpf_knn(iris, k = 4, metric = "euclidean", leaf_size = 3) # If you want to initialize another method (e.g. nearest neighbor descent) # with the result of the RP forest, then it's more efficient to skip # evaluating whether an item is a neighbor of itself by setting # `include_self = FALSE`: iris_rp <- rpf_knn(iris, k = 4, n_trees = 3, include_self = FALSE) # for future querying you may want to also return the RP forest: iris_rpf <- rpf_knn(iris, k = 4, n_trees = 3, include_self = FALSE, ret_forest = TRUE )
Run queries against a "forest" of Random Projection Trees (Dasgupta and Freund, 2008), to return nearest neighbors taken from the reference data used to build the forest.
rpf_knn_query( query, reference, forest, k, cache = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
rpf_knn_query( query, reference, forest, k, cache = TRUE, n_threads = 0, verbose = FALSE, obs = "R" )
query |
Matrix of |
reference |
Matrix of |
forest |
A random partition forest, created by |
k |
Number of nearest neighbors to return. You are unlikely to get good
results if you choose a value substantially larger than the value of
|
cache |
if |
n_threads |
Number of threads to use. Note that the parallelism in the
search is done over the observations in |
verbose |
If |
obs |
set to |
the approximate nearest neighbor graph as a list containing:
idx
an n by k matrix containing the nearest neighbor indices.
dist
an n by k matrix containing the nearest neighbor distances.
k
neighbors per observation are not guaranteed to be found. Missing data
is represented with an index of 0
and a distance of NA
.
Dasgupta, S., & Freund, Y. (2008, May). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing (pp. 537-546). doi:10.1145/1374376.1374452.
# Build a forest of 10 trees from the odd rows iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] iris_odd_forest <- rpf_build(iris_odd, n_trees = 10) iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_even_nn <- rpf_knn_query( query = iris_even, reference = iris_odd, forest = iris_odd_forest, k = 15 )
# Build a forest of 10 trees from the odd rows iris_odd <- iris[seq_len(nrow(iris)) %% 2 == 1, ] iris_odd_forest <- rpf_build(iris_odd, n_trees = 10) iris_even <- iris[seq_len(nrow(iris)) %% 2 == 0, ] iris_even_nn <- rpf_knn_query( query = iris_even, reference = iris_odd, forest = iris_odd_forest, k = 15 )