"cosine"
, "jaccard"
or "hellinger"
metrics
could give incorrect results if maximally-dissimilar items were in the nearest
neighbors. Thank you to Maciej Beręsewicz for the
report (https://github.com/jlmelville/rnndescent/issues/14).nnd_knn
and rnnd_build
: weight_by_degree
. If set
to TRUE
, then the candidate list in nearest neighbor descent is weighted in
favor of low-degree items, which should make for a more diverse local join.
There is a minor increase in computation but also a minor increase in accuracy.rnnd_build
generated an error when preparing the search graph for some
metrics (notably cosine
and jaccard
).prepare_search_graph
: use_alt_metric
. This behaves like
the existing use_alt_metric
parameters in other functions and may speed up
the index preparation step in e.g. rnnd_build
.rnnd_build
and prepare_search_graph
: prune_reverse
. If
set to TRUE
the reverse graph will be pruned using the
pruning_degree_multiplier
parameter before any diversification. This can help
prevent an excessive amount of time being spent in the diversification step in
the case where an item has a large number of neighbors (in the reverse graph
this can be as large as the number of items in the dataset). Pruning of the
merged graph still occurs, so this is an additional pruning step. This should
have little effect on search results, but for backwards compatibility, the
default is FALSE
.CRAN resubmission to fix lingering UBSAN errors.
Initial CRAN submission.
rnnd_build
now always prepares the search graph.rnnd_prepare
function has been removed. The option to not prepare
the search graph during index building only made sense if you were only
interested in the k-nearest neighbor graph. Now that rnnd_knn
exists for
that purpose (see below), the logic of index building has been substantially
simplified.nn_to_sparse
function has been removed.merge_knn
function has been removed, and merge_knnl
has been renamed
to merge_knn
. If you were running e.g. merge_knn(nn1, nn2)
, you must now
use merge_knn(list(nn1, nn2))
. Also the parameter nn_graphs
has been
renamed graphs
.rnnd_knn
. Behaves a lot like rnnd_build
, but only returns
the knn graph with no index built. The index can be very large in size for
high dimensional or large datasets, so this function is useful if you only
care about the knn graph and won't ever want to query new data.neighbor_overlap
. Measures the overlap of two knn graphs via
their shared indices. A similar function was used extensively in some vignettes
so it may have sufficient utility to be useful to others.rnnd_query
and graph_knn_query
: max_search_fraction
.
This parameter controls the maximum number of nodes that can be searched during
querying. If the number of nodes searched exceeds this fraction of the total
number of nodes in the graph, the search will be terminated. This can be
used in combination with epsilon
to avoid excessive search times.spearmanr
distance has been fixed.n_threads = 0
, progress/interrupt monitoring was
not occurring.init
parameter of rnnd_query
.rnnd_query
: if verbose = TRUE
, a summary of the min, max and average
number of distance queries will be logged. This can help tune epsilon
and
max_search_fraction
.local_scale_nn
has been removed, for similar reasons to the removal
of the standalone distance functions. It remains in the localscale
branch
of the github repo.prepare_search_graph
is now transposed. This
prevents having to repeatedly transpose inside every call to graph_knn_query
if multiple queries are being made. You will need to either regenerate any
saved search graphs or transpose them with Matrix::t(search_graph)
.rnnd_build
, rnnd_query
and rnnd_prepare
. These functions
streamline the process of building a k-nearest neighbor graph, preparing a
search graph and querying it.The bhamming
metric no longer exists as a specialized metric. Instead, if
you pass a logical
matrix to data
, reference
or query
parameter
(depending on the function) and specify metric = "hamming"
you will
automatically get the binary-specific version of the hamming metric.
The hamming
and bhamming
metrics are now normalized with respect to the
number of features, to be consistent with the other binary-style metrics (and
PyNNDescent). If you need the old distances, multiply the distance matrix by
the number of columns, e.g. do something like:
res <- brute_force_knn(X, metric = "hamming")
res$dist <- res$dist * ncol(X)
The metric l2sqr
has been renamed sqeuclidean
to be consistent with
PyNNDescent.
metric
parameter now accepts a much larger number
of metrics. See the rdoc for the full list of supported metrics. Currently, most
of the metrics from PyNNDescent which don't require extra parameters are
supported. The number of specialized binary metrics has also been expanded.rpf_knn
and rpf_build
: max_tree_depth
this controls
the depth of the tree and was set to 100 internally. This default has been
doubled to 200 and can now be user-controlled. If verbose = TRUE
and the
largest leaf in the forest exceeds the leaf_size
parameter, a message warning
you about this will be logged and indicates that the maximum tree depth has
been exceeded. Increasing max_tree_depth
may not be the answer: it's more
likely there is something unusual about the distribution of the distances in
your dataset and a random initialization might be a better use of your time.dgCMatrix
to the data
, reference
or
query
parameters where you would usually use a dense matrix or data frame.
cosine
, euclidean
, manhattan
, hamming
and correlation
are all
available, but alternative versions in the dense case, e.g. cosine-preprocess
or the binary-specific bhamming
for dense data is not.init
option for graph_knn_query
: you can now pass an RP forest and
initialize with that, e.g. from rpf_build
, or by setting ret_forest = TRUE
on nnd_knn
or rpf_knn
. You may want to cut down the size of the forest
used for initialization with rpf_filter
first, though (a single tree may be
enough). This will also use the metric data in the forest, so setting metric
(or use_alt_metric
) in the function itself will be ignored.prepare_search_graph
or to graph_knn_query
contains missing data, this will no longer cause an error (it still might not be
the best idea though).rpf_knn
. Calculates the approximate k-nearest neighbors using
a random partition forest.rpf_build
. Builds a random partition forest.rpf_knn_query
. Queries a random partition forest (built with
rpf_build
to find the approximate nearest neighbors for the query points.rpf_filter
. Retains only the best "scoring" trees in a forest,
where each tree is scored based on how well it reproduces a given knn.nnd_knn
: init = "tree"
. Uses the RP Forest
initialization method.nnd_knn
: ret_forest
. Returns the search forest used if
init = "tree"
so it can be used for future searching or filtering.nnd_knn
: init_opts
. Options that can be passed to the
RP forest initialization (same as in rpf_knn
).nnd_knn
with n_threads > 0
was reporting double the
actual number of iterations. This made the progress bar way too optimistic.metric
: "cosine"
and "correlation"
have been renamed
"cosine-preprocess"
and "correlation-preprocess"
respectively. This
reflects that they do some preprocessing of the data up front to make
subsequent distance calculations faster. I have endeavored to avoid unnecessary
allocations or copying in this preprocessing, but there is still a chance of
more memory usage.cosine
and correlation
metrics are still available as an option, but
now use an implementation that doesn't do any preprocessing. The preprocessing
and non-preprocessing version should give the same numerical results, give or
take some minor numerical differences, but when the distance should be zero,
the preprocessing versions may give values which are slightly different from
zero (e.g. 1e-7).correlation_distance
, cosine_distance
,
euclidean_distance
, hamming_distance
, l2sqr_distance
, manhattan_distance
for calculating the distance between two vectors, which may be useful for
more arbitrary distance calculations than the nearest neighbor routines here,
although they won't be as efficient (they do call the same C++ code, though).
The cosine and correlation calculations here use the non-preprocessing
implementations.hamming
metric to a standard definition. The old implementation
of hamming
metric which worked on binary data only was renamed into
bhamming
. (contributed by Vitalie Spinu)obs
has been added to most functions: set obs = "C"
and you
can pass the input data in column-oriented format.random_knn
function used to always return each item as its own neighbor,
so that only n_nbrs - 1
of the returned neighbors were actually selected at
random. Even I forgot it did that and it doesn't make a lot of sense, so now you
really do just get back n_nbrs
random selections.init
parameter to nnd_knn
or
graph_knn_query
: previously, if k
was specified and larger than the number
of neighbors included in init
, this gave an error. Now, init
will be
augmented with random neighbors to reach the desired k
. This could be useful
as a way to "restart" a neighbor search from a better-than-random location if
k
has been found to have been too small initially. Note that the random
selection does not take into account the identities of the already chosen
neighbors, so duplicates may be included in the augmented result, which will
reduce the effective size of the initialized number of neighbors.block_size
and grain_size
parameters from functions. These
were related to the amount of work done per thread, but it's not obvious to
an outside user how to set these.verbose = TRUE
) and respond to user-requested cancellation.nnd_knn_query
has been renamed to graph_knn_query
and now more closely
follows the current pynndescent graph search method (including backtracking
search).prepare_search_graph
for preparing a search graph from a
neighbor graph for use in graph_knn_query
, by using reverse nearest neighbors,
occlusion pruning and truncation.graph_knn_query
.There was a major rewrite of the internal organization of the C++ to be less R-specific.
The license for rnndescent has changed from GPLv3 to GPLv3 or later.
"correlation"
. This is (1 minus) the Pearson correlation.k_occur
which counts the k-occurrences of each item in the
idx
matrix, which is the number of times an item appears in the k-nearest
neighbor list in the dataset. The distribution of the k-occurrences can be
used to diagnose the "hubness" of a dataset. Items with a large k-occurrence
(>> k, e.g. 10k), may indicate low accuracy of the approximate nearest neighbor
result.To avoid undefined behavior issues, rnndescent now uses an internal
implementation of RcppParallel's parallelFor
loop that works with
std::thread
and does not load Intel's TBB library.
dqrng
sample routines
from inside a thread, despite it clearly using the R API extensively. It's not
ok and causes lots of crashes. There is now a re-implementation of dqrng
's
sample routines using plain std::vector
s in src/rnn_sample.h
. That file is
licensed under the AGPL (rnndescent
as a whole remains GPL3).merge_knn
, to combine two nearest neighbor graphs. Useful for
combining the results of multiple runs of nnd_knn
or random_knn
. Also,
merge_knnl
, which operates on a list of multiple neighbor graphs, and can
provide a speed up over merge_knn
if you don't mind storing multiple graphs
in memory at once.nnd_knn
with n_threads > 1
and random_knn
with n_threads > 1
and order_by_distance = TRUE
.nnd_knn
with n_threads > 1
due to
the use of a mutex pool.Mainly an internal clean-up to reduce duplication.
nnd_knn
and nnd_knn_query
use the same progress bar as
the brute force and random neighbor functions. Bring back the old per-iteration
logging that also showed the current distance sum of the knn with the
progress = "dist"
option.random_knn
and random_knn_query
, when order_by_distance = TRUE
and
n_threads > 0
, the final sorting of the knn graph will be multi-threaded.n_threads > 0
.nnd_knn_query
being
the most useful, but brute_force_knn_query
and random_knn_query
are also
available. This allows for query
data to search reference
data, i.e. the
returned indices and distances are relative to the reference
data, not any
other member of query
. These methods are also available in multi-threaded
mode, and nnd_knn_query
has a low and high memory version.l2
metric has been renamed to l2sqr
to more accurately reflect what
it is: the square of the L2 (Euclidean) metric.use_alt_metric
. Set to FALSE
if you don't want alternative,
faster metrics (which keep the distance ordering of metric
) to be used in
internal calculations. Currently only applies to metric = "euclidean"
, where
the squared Euclidean distance is used internally. Only worth setting this to
FALSE
if you think the alternative is causing numerical issues (which is
a bug, so please report it!).block_size
for parallel methods, which determines the amount
of work done in parallel before checking for user interrupt request and updating
any progress.random_knn
now returns its results in sorted order. You can turn this off
with order_distances = FALSE
, if you don't need the sorting (e.g. you are
using the results as input to something else).brute_force
and random
methods should now be
correct.brute_force_knn
.random_knn
.verbose = TRUE
.fast_rand
option has been removed, as it only applied to single-threading,
and had a negligible effect.Also, a number of changes inspired by recent work in https://github.com/lmcinnes/pynndescent:
rho
sampling parameter has been removed. The size of the candidates
(general neighbors) list is now controlled entirely by max_candidates
.max_candidates
has been reduced to 20.use_set
logical flag has been replaced by low_memory
, which has the
opposite meaning. It now also works when using multiple threads. While it
follows the pynndescent implementation, it's still experimental, so
low_memory = TRUE
by default for the moment.low_memory = FALSE
implementation for n_threads = 0
(originally
equivalent to use_set = TRUE
) is faster.block_size
, which balances interleaving of queuing updates
versus applying them to the current graph.