--- title: "Hubness" output: rmarkdown::html_vignette: fig_width: 4 fig_height: 4 vignette: > %\VignetteIndexEntry{Hubness} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bibliography.bibtex --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(rnndescent) ``` Nearest Neighbor Descent (NND) [@dong2011efficient] is affected by hubness [@bratic2019influence]: this is when some items in a dataset appear as a near neighbor of other points very frequently. This can result in reduced accuracy of the approximate nearest neighbor graph produced by NND and may be an intrinsic problem in high dimensional datasets (although see [@low2013hubness] for a dissenting view). In this vignette we will use synthetic data to explore the issue, and use the k-occurrences of a neighbor graph to identify when NND is at risk of producing less accurate results. We will also look at some ways to ameliorate the effects of hubness. This will make use of some utility functions in `rnndescent` which can be useful for assessing the accuracy of the approximate nearest neighbors: `k_occur`, `neighbor_overlap`, and `merge_knn`. First, to control the pseudo-random number generation: ```{r set seed} set.seed(42) ``` Now let's create some Gaussian data to test with. First, a low-dimensional example: ```{r 2D Gaussian} n_points <- 1000 low_dim <- 2 g2d <- matrix(rnorm(n_points * low_dim), ncol = low_dim) ``` In this vignette, we are interested in the 15 nearest neighbors. To get the exact nearest neighbors, we use the `brute_force_knn` function with `k = 15`: ```{r brute force 2D results} g2d_nnbf <- brute_force_knn(g2d, k = 15, metric = "euclidean") ``` This will act as our "ground truth" and we will compare how well NND does. To use NND to find the approximate nearest neighbors, we use the `nnd_knn` function: ```{r nearest neighbor descent 2D results} g2d_nnd <- nnd_knn(g2d, k = 15, metric = "euclidean") ``` To calculate the accuracy of NND we will use the `neighbor_overlap` function to find the mean overlap of the nearest neighbor indices produced by the methods. `0` means that none of the exact 15-nearest neighbors were in the list of 15-nearest neighbors that NND found, and `1` means they all were. With low-dimensional data, nearest neighbor descent does very well: ```{r 2D nnd accuracy} neighbor_overlap(g2d_nnbf, g2d_nnd, k = 15) ``` Now let's see what happens with a high-dimensional (1000 features): ```{r 1000D Gaussian} hi_dim <- 1000 g1000d <- matrix(rnorm(n_points * hi_dim), ncol = hi_dim) ``` Again we will use brute force to generate the true nearest neighbors. ```{r brute force 1000D results} g1000d_nnbf <- brute_force_knn(g1000d, k = 15, metric = "euclidean") ``` Let's do NND on the high dimensional data... ```{r nearest neighbor descent 1000D results} g1000d_nnd <- nnd_knn(g1000d, k = 15, metric = "euclidean") ``` ...and see how well it does: ```{r 1000D nnd accuracy} neighbor_overlap(g1000d_nnbf, g1000d_nnd, k = 15) ``` Still OK, but not as good as we might like. ## Comparing Low- and High-Dimensional Nearest Neighbors Let's look at the distribution of nearest neighbor distances in high and low dimensions (for easier comparison, I have normalized them with respect to the largest distance) ```{r NN distances distribution} hist(g2d_nnbf$dist[, -1] / max(g2d_nnbf$dist[, -1]), xlab = "distances", main = "2D 15-NN") hist(g1000d_nnbf$dist[, -1] / max(g1000d_nnbf$dist[, -1]), xlab = "distances", main = "1000D 15-NN") ``` Compared to low dimensional data, we can see that the high dimensional distances are distributed around a higher distance as well as more symmetric in their distribution. Here are the distribution of the neighbor distances in the high-dimensional case for the neighbors found by NND: ```{r NND distances distribution} hist(g1000d_nnd$dist[, -1] / max(g1000d_nnd$dist[, -1]), xlab = "distances", main = "1000D 15-NND" ) ``` Pretty much indistinguishable from the exact results, so it seems like there isn't an obvious diagnostic from the distances themselves. Is the distribution of the errors in the NND results uniform or are some items neighborhoods noticeably better predicted than others? This function will calculate a vector of the relative RMS error between two sets of neighbors in terms of distances: ```{r distance relative RMS error} nn_rrmsev <- function(nn, ref) { n <- ncol(ref$dist) - 1 sqrt(apply((nn$dist[, -1] - ref$dist[, -1])^2 / n, 1, sum) / apply(ref$dist[, -1]^2, 1, sum)) } ``` We don't include the first nearest neighbor distances, as these are invariably the self distances which leads to an uninteresting number of zero error results. This measure of error is a bit less strict than `neighbor_overlap` as a neighbor that is outside the true kNN, but which has a comparable distance, will be penalized a less harshly than a more distant point. Here's a histogram of RRMS distance errors: ```{r histogram of distance difference} g1000d_rrmse <- nn_rrmsev(g1000d_nnd, g1000d_nnbf) hist(g1000d_rrmse, main = "1000D distance error", xlab = "Relative RMS error" ) ``` None of the relative errors are actually that large, so if you only care about the value of kth nearest neighbor distances, then even in the 1000D case, we still get decent results in this case. We can also see that there is a clear distribution of errors, where an appreciable number of items have zero RRMS distance errors, but there a few items which have the largest error. We can also get out the overlap on a per-item basis. If you set `ret_vec = TRUE`, `neighbor_overlap` will now return a list with the individual overlaps as the `overlaps` vector (the overall mean average overlap is returned as the `mean` item). Here is a histograms of the accuracies: ```{r hist 1000D} g1000d_nnd_acc <- neighbor_overlap(g1000d_nnbf, g1000d_nnd, k = 15, ret_vec = TRUE)$overlaps hist(g1000d_nnd_acc, main = "1000D accuracy", xlab = "accuracy", xlim = c(0, 1) ) ``` This shows a similar pattern: some items have very high accuracy and some have noticeably worse accuracy than average. For completeness, here is the relationship between accuracy and RRMSE: ```{r rrmse vs acc} plot( g1000d_nnd_acc, g1000d_rrmse, main = "RRMSE vs accuracy", xlab = "accuracy", ylab = "RRMSE" ) ``` Nothing very surprising here: there's a fairly consistent spread of RRMSE for a given accuracy. So whether you use an error in the distance or accuracy to measure how well the approximate nearest neighbors method is working, at least in this case, a high dimensional dataset affects some items more than others. ## Detecting Hubness [@radovanovic2010hubs] discusses a technique for detecting hubness: look for items that appear very frequently in the k-nearest neighbor graph. The `k_occur` function counts "k-occurrences" of each item in a dataset, i.e. a count of the number of times an item appears in the k-nearest neighbor graph. You can also see it as reversing the direction of the edges in the k-nearest neighbor graph and then counting the in-degree of each item. If the distribution of neighbors was entirely uniform we would expect to see each item appear $k$ times. If there are hubs then the k-occurrence could get as large as the size of the dataset, $N$. An item which appears in the neighbor graph fewer than $k$ times could be termed an "antihub". Our definition of a neighbor of an item always includes the item itself, so we would expect the minimum $k$-occurrence to be $1$. Also, because there are always only $Nk$ edges in a $k$-nearest neighbor graph, if an item appears more than the expected amount this implies that other items must be under-represented. Practically speaking, there are always going to be items with a larger $k$-occurrence than expected and hence some with a lower $k$-occurrence, so hubness or anti-hubness is more a case of deciding on a cut-off after which the presence of an item with a lot of neighbors starts causing you problems, which is going to be dependent on what you are planning to do with the neighbor graph (and probably the number of neighbors you want). ### k-occurrence in the 2D case First, let's look at the 2D case using the exact k-nearest neighbors: ```{r 2D k-occurrences} g2d_bfko <- k_occur(g2d_nnbf, k = 15) summary(g2d_bfko) ``` The mean average of the $k$-occurrence is never helpful: as noted above there are always $Nk$ edges in the neighbor graph, so the mean $k$-occurrence is always $k$. However the other descriptions of the distribution are informative. The median $k$-occurrence is also `15`, which is a good sign, and the values at 25% and 75% aren't too different other. The maximum $k$-occurrence is less than $2k$. The minimum value is `1` which means there are anti-hubs in the dataset, but: ```{r number of anti-hubs in 2D case} sum(g2d_bfko == 1) ``` there is only one anti-hub in this dataset. Here's a histogram of the k-occurrences: ```{r 2D k-occurrence histogram} hist(g2d_bfko, main = "2D 15-NN", xlab = "k-occurrences") ``` This unremarkable-looking distribution is a visual indication of a dataset without a lot of hubness and anti-hubs lurking to cause problems for nearest neighbor descent. ### k-occurrence in the 1000D case Here's what the k-occurrence histogram looks like for the high dimensional case: ```{r 1000D k-occurrences} g1000d_bfko <- k_occur(g1000d_nnbf$idx, k = 15) hist(g1000d_bfko, main = "1000D 15-NN", xlab = "k-occurrences") ``` The differences are pretty stark. The first thing to notice is the x-axis. In the 2D case, the maximum k-occurrence was ~20. For the 1000D we are looking at ~300. It's hard to see any details, so let's zoom in on the same region as the 2D case by clipping any k-occurrence larger than the largest 2D k-occurrence: ```{r zoomed k-occurrences} hist(pmin(g1000d_bfko, max(g2d_bfko)), main = "1000D 15-NN zoomed", xlab = "k-occurrences" ) ``` It's a very different distribution to the 2D case: we have a large number of anti-hubs and a noticeable number of hubs. There's certainly no peak at a k-occurrence of 15. Comparing the numerical summary with the 2D case is instructive: ```{r numerical summary of 1000D k-occurrence distribution} summary(g1000d_bfko) ``` Again, here's a good reminder that the mean k-occurrence is of no value. The median k-occurrence immediately communicates the difference between the 2D case. We can also see that the maximum k-occurrence means that there is one point which is considered a close neighbor of over one third of the dataset. How many anti-hubs are there? ```{r number of anti-hubs in the 1000D case} sum(g1000d_bfko == 1) ``` A quarter of the dataset does not appear as a neighbor of any other point. This has serious implications for using a neighbor graph for certain purposes: you cannot reach a quarter of the dataset by starting at an arbitrary point in the graph and following neighbors. This also might point to why nearest neighbor descent has trouble with this high dimensional case: if we rely on points turning up as a neighbors of other points in order to introduce them to potential neighbors, the fact that so many of the points in this dataset aren't anyone's actual neighbors would suggest they are unlikely to get involved in the local join procedure as much as other points. ### k-occurrence as a diagnostic of NND failure We have now shown that we can use k-occurrences on the exact nearest neighbors of low and high dimensional data to detect the existence of hubs, which in turn might lead us to suspect that the approximate nearest neighbors found by nearest neighbor descent may not be very accurate. But that's not a very useful diagnostic because if we have the exact neighbors we don't need to run NND in the first place. But even if the approximate nearest neighbor graph produced by NND isn't highly accurate, does it still show similar characteristics of hubness? ```{r 1000D NND k-occurences} g1000d_nndko <- k_occur(g1000d_nnd$idx, k = 15) hist(g1000d_nndko, main = "1000D 15-NND", xlab = "k-occurrences") ``` That seems similar to the true results, and zooming in like we did with the exact results: ```{r zoomed NND k-occurrences} hist(pmin(g1000d_nndko, max(g2d_bfko)), main = "1000D 15-NND zoomed", xlab = "k-occurrences" ) ``` Visually this looks a lot like the distribution of the exact results. Next, the numerical summary: ```{r 1000D NND k-occurences numeric summary} summary(g1000d_nndko) sum(g1000d_nndko == 1) ``` Quantitatively, this also tracks the exact results: the median k-occurrence is much smaller than $k$, there is a hub with a very large number of neighbors (larger than in the exact case but to a similar degree) and a similar number of anti-hubs. So this suggests a way to diagnose if the nearest neighbor descent routine may have low accuracy: look at the distribution of the k-occurrences of the resulting approximate nearest neighbor graph (or even just the maximum value). A value that is $\gg k$ may mean a reduced accuracy. Of course, this isn't foolproof, because even if NND did a perfect job then we would still get these sorts of values, but it's a starting point. Taking the distribution of k-occurrences as a whole, the approximate results seem to track the exact results fairly well, but as we have seen, the errors in the approximate results are not uniformly distributed across the data. So let's see how well the NND k-occurrences "predict" the exact results: ```{r approximate vs true 1000D k-occurrence} plot(g1000d_nndko, g1000d_bfko, xlab = "approximate", ylab = "exact", xlim = c(0, max(g1000d_nndko, g1000d_bfko)), ylim = c(0, max(g1000d_nndko, g1000d_bfko)), main = "1000D k-occ" ) abline(a = 0, b = 1) cor(g1000d_nndko, g1000d_bfko, method = "pearson") ``` The overall relationship seems strong. The line on the plot is x=y, so we can see that at high values of the k-occurrence the NND results tend to over-estimate the k-occurrence, but these are such large values that this hardly matters, and there is no ambiguity over which nodes are most hub-like. Zooming in to lower values of the k-occurrence: ```{r zoomed approximate vs true 1000D k-occurrence} plot(g1000d_nndko, g1000d_bfko, xlab = "approximate", ylab = "exact", xlim = c(0, max(g2d_bfko)), ylim = c(0, max(g2d_bfko)), main = "1000D low k-occ" ) abline(a = 0, b = 1) ``` here it seems that there is a tendency to over-estimate the k-occurrence. Anti-hubs are also not perfectly identified, but there are no true anti-hubs which appear more than a small number of times in the approximate neighbor graph. ### Detecting Poorly Predicted Neighbors We've seen that some objects have their neighbors predicted better than others. Based on everything we've seen so far about k-occurrences and NND, it would be reasonable to wonder: are the items in a dataset with poorly predicted neighbors the anti-hubs (predicted or exact)? This would at least give us some way of detecting those items that were likely to have low accuracy neighborhoods: perhaps they could be treated specially (or by some other algorithm). Here's a plot of the accuracy against the k-occurrences of the NND neighbors: ```{r predicting accuracy with NND k-occurrence} plot(g1000d_nndko, g1000d_nnd_acc, xlab = "NND k-occ", ylab = "accuracy", xlim = c(0, max(g1000d_nndko, g1000d_bfko)), main = "1000D acc vs NND k-occ" ) ``` So the answer to the question is "not really", but there *is* a trend. The empty space in the lower right of the plot indicates that items with a large k-occurrence (hubs) are very well predicted. And above a k-occurrence of 150, we are guaranteed to perfectly predict the neighborhood of an item. However, at the other end of the k-occurrence spectrum, we can see that while the lower bound on the predicted accuracy does plummet as the k-occurrence is reduced, some anti-hubs actually do have their neighborhoods very accurately predicted too. Unfortunately this means that k-occurrence is a bit too rough to use to predict poorly-predicted items. Let's say that we wanted to get all the items where the neighborhood was less than 90% accurate: ```{r proportion of items with < 90% accuracy} sum(g1000d_nnd_acc < 0.9) ``` That's already quite a lot of items: about three-quarters of the entire dataset. What is the largest k-occurrence for an item in the dataset with that accuracy threshold? ```{r max k-occurrence for lower accuracy items} max(g1000d_nndko[g1000d_nnd_acc < 0.9]) ``` Then, to guarantee that we had found all the items that might be poorly predicted, we would need to filter out every item that had a k-occurrence smaller than that value, even though we know that some of them are well-predicted: ```{r how many items} sum(g1000d_nndko <= max(g1000d_nndko[g1000d_nnd_acc < 0.9])) ``` That's most of the dataset. If we dropped the threshold to 80 accuracy, does it help? ```{r how many items at a lower accuracy thresold} sum(g1000d_nndko <= max(g1000d_nndko[g1000d_nnd_acc < 0.8])) ``` A bit, but it's still a substantial majority of the dataset. So whatever we decided to do with these items we wouldn't be saving a huge amount of effort. So much for that idea. What this suggests is not that we *can't* improve results here, just that the effort of identifying individual points to filter out, treat differently and then merge back into the final neighbor graph means that just reprocessing the entire dataset in a different way is likely to be a competitive solution. ### Detecting Problems Early Back to looking at the k-occurrence distribution as a whole: we can see that the converged NND results, despite not being 100% accurate do a good job at expressing the hubness of the underlying data. How converged do the results need to be? What if we think of NND as a tool for identifying hubness in datasets as a whole rather than for accurate approximate nearest neighbor graphs? Could a much less unconverged NND graph, while obviously being even less accurate, still correctly identify a dataset as having hubs? To test this, let's run the NND method for only one iteration and get the k-occurrences that result: ```{r unconverged NND} g1000d_nnd_iter1 <- nnd_knn(g1000d, k = 15, metric = "euclidean", n_iters = 1) g1000d_nndkoi1 <- k_occur(g1000d_nnd_iter1$idx, k = 15) ``` How accurate are these results? ```{r unconverged NND accuracy} neighbor_overlap(g1000d_nnbf, g1000d_nnd_iter1, k = 15) ``` Ok, I think we can all agree we do *not* have an accurate neighbor graph. But let's take a look at the k-occurrence distribution: ```{r unconverged ko distribution} hist(g1000d_nndkoi1, main = "1000D 15-NND (1 iter)", xlab = "k-occurrences") ``` Looking familiar. Zooming in... ```{r unconverged ko distribution zoomed} hist(pmin(g1000d_nndkoi1, max(g2d_bfko)), main = "1000D 15-NND (1 iter, zoomed)", xlab = "k-occurrences" ) ``` The distribution is at least similar to the converged version. Taking a look at some numbers: ```{r unconverged ko distribution numerical summary} summary(g1000d_nndkoi1) sum(g1000d_nndkoi1 == 1) ``` Compared to the converged (or exact) distribution, the median k-occurrence is not as low, the object with the largest k-occurrence, while large ($> 10k$, which seems like a good threshold to be concerned about the presence of hubs) is not as large, and there are fewer objects which are anti-hubs. At least for this dataset, hubness can be qualitatively detected with even a very inaccurate neighbor graph. What about datasets that don't contain hubs? Let's just check that what we are seeing is not an artifact of unconverged nearest neighbor descent, by running through the same procedure with the 2D dataset: ```{r unconverged 2D NND} g2d_nnd_iter1 <- nnd_knn(g2d, k = 15, metric = "euclidean", n_iters = 1) g2d_nndkoi1 <- k_occur(g2d_nnd_iter1$idx, k = 15) hist(g2d_nndkoi1, main = "2D 15-NND (1 iter)", xlab = "k-occurrences") summary(g2d_nndkoi1) sum(g2d_nndkoi1 == 1) neighbor_overlap(g2d_nnbf, g2d_nnd_iter1, k = 15) ``` We can see that the neighbor graph is also not very accurate after 1 iteration in the 2D case, but the distribution of k-occurrences also qualitatively resembles the exact result. This time, compared to the exact results there are slightly more anti-hubs and the maximum k-occurrence is increased, so the trends are slightly reversed compared to the 1000D data. For at least qualitative identification of hubness, then, one iteration of nearest neighbor descent might be enough. ## Improving accuracy We know that nearest neighbor descent (at least with typical settings) may not give highly accurate results in high dimensions. And with the help of k-occurrences, we can even detect that it might be happening. But what can we do about it? ### Weight candidates by degree When creating the local join list for each object, we only have space for `max_candidates` candidates. If there's more than that, then the candidates are randomly chosen. But with a uniform random weighting, low-degree items could get crowded out of over-subscribed candidate lists, but don't get an opportunity to appear in many other lists. Hub items, on the other hand, are going to appear on multiple candidate lists, so even if they get crowded out of a few lists, they are still going to be involved in the local join. So what if, instead of weighting the candidates randomly but uniformly, we weight the random selection by the degree of the candidate? A hub with a large degree will tend to have a larger weight, and due to how `rnndescent`'s internal heap implementation works, will make it more likely to be evicted from the heap. Conversely, a low-degree item will tend to have a lower weight. This should give a more equitable distribution of candidates. However, there is a small cost to doing this. At each iteration, we need to calculate and store the k-occurrence of each item in the current state of the graph. Honestly, this isn't a huge amount of work, but as a result `weight_by_degree = FALSE` by default. For the high-dimensional case, it's worth trying though: ```{r NND 1000D weight by degree} g1000d_nnd_w <- nnd_knn(g1000d, k = 15, metric = "euclidean", weight_by_degree = TRUE ) neighbor_overlap(g1000d_nnbf, g1000d_nnd_w) ``` Like a 2%ish improvement. It's not nothing. And despite the extra computation and $O\left(kN\right)$ storage required to calculate the k-occurrences in each iteration, with high-dimensional data the distance calculations tend to dominate the run time, so there's no reason to *not* set `weight_by_degree = TRUE` in this case. However, for the rest of the vignette, we'll keep it set to `FALSE` (the default) so it's easier to evaluate the contribution of other changes. ### Use More Neighbors One simple (slightly expensive) way is to keep more neighbors in the calculation. For example, double the number of neighbors to `30`, then get the top-15 accuracy: ```{r NND 1000D truncated 30} g1000d_nnd_k30 <- nnd_knn(g1000d, k = 30, metric = "euclidean") neighbor_overlap(g1000d_nnbf, g1000d_nnd_k30, k = 15) ``` That's a big improvement, but increasing `k` in this way can be quite expensive in terms of run time. ### Use More Candidates Another way to improve accuracy is to increase the number of candidates that are considered at each iteration. This is controlled by the `max_candidates` parameter. Let's try increasing that to `30`: ```{r NND 1000D max_candidates 30} g1000d_nnd_mc30 <- nnd_knn(g1000d, k = 15, metric = "euclidean", max_candidates = 30) neighbor_overlap(g1000d_nnbf, g1000d_nnd_mc30) ``` Not as good as increasing `k`, but also not as time-consuming and still an improvement over the default setting (which is set to be the same as `k`). This should probably be your first choice for improving accuracy. ### Decrease the convergence tolerance You could set `delta` to lower than the default of `0.001` or even to `0` to force the algorithm to run until convergence: ```{r NND 1000D tol 0} g1000d_nnd_tol0 <- nnd_knn(g1000d, k = 15, metric = "euclidean", delta = 0) neighbor_overlap(g1000d_nnbf, g1000d_nnd_tol0) ``` But as you can see, it's unlikely to achieve very much. The default parameters do a pretty good job of balancing accuracy and speed. ### Merging Multiple Independent Results What about taking advantage of the stochastic nature of the algorithm? If the results are sufficiently diverse between runs of NND, then we could generate two graphs from two separate runs, and then merge the results. Let's repeat NND and see what the accuracy of this new result is like. ```{r NND 1000D repeat} g1000d_nnd_rep <- nnd_knn(g1000d, k = 15, metric = "euclidean") neighbor_overlap(g1000d_nnbf, g1000d_nnd_rep, k = 15) ``` That's similar to the first run. That's re-assuring in the sense that the variance of the accuracy doesn't seem to be that high between one run to the next. But hopefully that doesn't also mean that NND is producing a very similar neighbor graph each time, in which case merging them won't be very helpful. Time to find out: ```{r merge} g1000d_nnd_merge <- merge_knn(list(g1000d_nnd, g1000d_nnd_rep)) neighbor_overlap(g1000d_nnbf, g1000d_nnd_merge, k = 15) ``` That's a big improvement. So it does seem like there is some diversity in the results. ```{r NND 100D compare accuracy} g1000d_nnd_rep_acc <- neighbor_overlap(g1000d_nnbf, g1000d_nnd_rep, k = 15, ret_vec = TRUE)$overlaps plot( g1000d_nnd_acc, g1000d_nnd_rep_acc, main = "1000D NND accuracy comparison", xlab = "accuracy run 1", ylab = "accuracy run 2" ) cor(g1000d_nnd_acc, g1000d_nnd_rep_acc) ``` Despite the similar overall accuracies, there's quite a large variance between runs in terms of which items have accurate neighborhoods. So there might be some scope for improving the results by merging different runs, especially if you can run the individual NND routines in parallel. ### Using a Search Graph Practically, the simplest way to improve results with `rnndescent` is to convert the neighbor graph into a search graph, and then query it with the original data. First, the preparation step: ```{r prepare graph} g1000d_search_graph <- prepare_search_graph( data = g1000d, graph = g1000d_nnd, metric = "euclidean", diversify_prob = 1, pruning_degree_multiplier = 1.5 ) ``` This augments the neighbor graph with the reversed edges of the neighbor graph, so that if $i$ is one of the nearest neighbors of $j$, we guarantee that $j$ is also considered a near neighbor $i$. This ameliorates the issue of anti-hubs because all $k$ neighbors of an anti-hub now have it in their neighbor list. The downside of including all reversed edges in the neighbor graph is that the neighbor list of a hub is now going to be very large as it consists of the $k$ nearest neighbors of the hub and then all the items that consider the hub a near neighbor, which by definition is a lot. This can make the search graph inefficient, as a disproportionate amount of time will be spent searching neighbors of the hub. The `diversify_prob` and `pruning_degree_multiplier` parameters are used to reduce back down the out-degree of each node (the number of out-going edges). This results in objects with a varying number of neighbors, in this case to a maximum of 22. This is about 50% larger than `k = 15` to account for the introduction of the reverse edges. Anti-hubs can be reintroduced due to the edge reduction, but hopefully the distribution of edges is a bit more equitable. Here is a summary and histogram of the k-occurrences of the search graph: ```{r histogram of k-occurrences of search graph} g1000d_sgko <- k_occur(g1000d_search_graph) hist(g1000d_sgko, main = "search graph k-occurrences", xlab = "k-occurrences") summary(g1000d_sgko) sum(g1000d_sgko == 1) ``` This is not *quite* as skewed as the neighbor graph, but there is still a lot of room for improvement. At any rate, with the search graph in hand, we can now search it using our original data as a query: ```{r search with prepared graph} g1000d_search <- graph_knn_query( query = g1000d, reference = g1000d, reference_graph = g1000d_search_graph, k = 15, metric = "euclidean", init = g1000d_nnd, epsilon = 0.1 ) ``` Are the results improved? ```{r accuracy with search graph} neighbor_overlap(g1000d_nnbf, g1000d_search, k = 15) ``` Yes, the accuracy is now nearly perfect. The disadvantages of the search graph approach for building a neighbor graph is that it is less efficient than NND: `graph_knn_query` must assume that the `query` data is entirely different to the `reference` data. The advantage is that we can make use of reverse edges and, more importantly, back-tracking (controlled via the `epsilon` parameter), which seems to make the difference in this example. The procedure above is the recommended practice of using `graph_knn_query` with a search graph generated from the neighbor graph. You are not required to use a search graph as the argument to the `reference_graph` parameter. Here is the back-tracking search using the neighbor graph directly and everything else the same: ```{r search with neighbor graph} g1000d_nnd_search <- graph_knn_query( query = g1000d, reference = g1000d, reference_graph = g1000d_nnd, k = 15, metric = "euclidean", init = g1000d_nnd, epsilon = 0.1 ) neighbor_overlap(g1000d_nnbf, g1000d_nnd_search, k = 15) ``` Accuracies are nearly as good. You can save even more time by turning off back-tracking (`epsilon = 0`): ```{r search with neighbor graph, no back-tracking} g1000d_nnd_search0 <- graph_knn_query( query = g1000d, reference = g1000d, reference_graph = g1000d_nnd, k = 15, metric = "euclidean", init = g1000d_nnd, epsilon = 0 ) neighbor_overlap(g1000d_nnbf, g1000d_nnd_search0, k = 15) ``` but accuracies are now noticeably less improved. Using the search graph without back-tracking gives slightly better accuracies: ```{r search with search graph, no back-tracking} g1000d_search0 <- graph_knn_query( query = g1000d, reference = g1000d, reference_graph = g1000d_search_graph, k = 15, metric = "euclidean", init = g1000d_nnd, epsilon = 0 ) neighbor_overlap(g1000d_nnbf, g1000d_search0, k = 15) ``` but it seems like some sort of back-tracking is to be recommended with this approach. The downside is that you could be searching through a large proportion of the dataset, in which case you would save time by using the `brute_force_query`. In conjunction with `epsilon`, you can also use `max_search_fraction` which will terminate the search if the fraction of the dataset searched exceeds the specified value. Set it to a value between 0 and 1, e.g. `max_search_fraction = 0.1` if you don't want more than 10% of the reference data being searched. The default is `1`, so that `epsilon` entirely controls the search termination. A similar parameter is used in [@harwood2016fanng]. ```{r search at most 10% of the data} g1000d_nnd_search_max <- graph_knn_query( query = g1000d, reference = g1000d, reference_graph = g1000d_nnd, k = 15, metric = "euclidean", init = g1000d_nnd, epsilon = 0.1, max_search_fraction = 0.1 ) neighbor_overlap(g1000d_nnbf, g1000d_nnd_search_max, k = 15) ``` Accuracies are again reduced. ## Conclusions * High dimensional data leads to hubs. * The "hubness" of an item in a dataset can be measured by the k-occurrence in the corresponding nearest neighbor graph. The higher the k-occurrence, the more of a hub it is. * The existence of a hubs implies the existence of "anti-hubs", i.e. items with a low k-occurrence. A small number of hubs can create a disproportionately larger number of anti-hubs, with a larger value of the k-occurrence creating more anti-hubs. * The accuracy of nearest neighbor descent is reduced by the presence of hubs: specifically, the lower the k-occurrence of an item, the greater the probability of a low accuracy of its nearest neighbors. * Accuracy of nearest neighbor descent can be improved by searching for a larger number of neighbors and then truncating the result to the desired size, at the cost of a longer run-time and memory usage. * Alternatively, you can run the the nearest neighbor descent multiple times from different random starting points and merge the results. * Setting `weight_by_degree = TRUE` is probably worth doing, albeit it gives a very modest improvement. * More accurate and efficient results are obtained by converting the nearest neighbor descent results into a search graph and then querying the graph with the original data, using the nearest neighbor results for initialization and a back-tracking search. If you are concerned with potential hubs interfering with the accuracy of the neighbor graph, I suggest the following steps: 1. Generate a neighbor graph with `nnd_knn` and default parameters, or even for just 1 iteration. 2. Evaluate the hubness of the graph with `k_occur`. 3. If the maximum k-occurrence exceeds a threshold (maybe `10 * k` is a good starting point), then you could try the following: 4. Restart `nnd_knn` (you as may as well use the output from the first step as the initialization) with `max_candidates` set to a larger value. You probably don't want to set it larger than `60`. 5. Repeat step 4 to get a second graph. 6. Merge the graphs from steps 4 and 5 with `merge_knn`. 7. Use the merged graph for `prepare_search_graph`, and then run `graph_knn_query` with back-tracking search (set `epsilon > 0`) to refine the results further. 8 If in step 3, the maximum k-occurrence is *less* than the threshold, then you are probably ok to run `nnd_knn` with defaults. This should provide a robust approach to producing accurate approximate nearest neighbors without spending time on unnecessary graph search when the results are probably already quite good. For more on the effect of hubness and nearest neighbors, and more advanced attempts to fix the problem, see the work of Flexer and co-workers [@schnitzer2012local; @flexer2016empirical; @feldbauer2019comprehensive; @feldbauer2019scikit] and Radovanović and co-workers [@radovanovic2010hubs; @bratic2019influence]. ## References