num_threads
directly in umap2
did not result in the number of SGD
threads being updated to that value when batch = TRUE
, which it should have
been.umap_transform
continued to return the fuzzy graph in transposed form. Thank you
PedroMilanezAlmeida for
reopening the issue (https://github.com/jlmelville/uwot/issues/118).RSpectra
is now a required dependency (again). It was a required dependency
up until version 0.1.12, when it became optional (irlba
was used in its
place). However, problems with interactions of the current version of irlba
with an ABI change in the Matrix
package means that it's hard for downstream
packages and users to build uwot
without re-installing Matrix
and irlba
from source, which may not be an option for some people. Also it was causing a
CRAN check error. I have changed some tests, examples and vignettes to use
RSpectra
explicitly, and to only test irlba
code-paths where necessary. See
https://github.com/jlmelville/uwot/issues/115 and links therein for more
details.nn_method = "hnsw"
to use it. The behavior of the method can be controlled by
the new nn_args
parameter, a list which may contain M
, ef_construction
and ef
. See the hnswlib library's
ALGO_PARAMS documentation
for details on these parameters. Although typically faster than Annoy (for a
given accuracy), be aware that the only supported metric
values are
"euclidean"
, "cosine"
and "correlation"
. Finally, RcppHNSW is only a
suggested package, not a requirement, so you need to install it yourself (e.g.
via install.packages("RcppHNSW")
). Also see the
article on HNSW in uwot
in the documentation.nn_method = "nndescent"
to use it. The behavior of the method can be
controlled by the new nn_args
parameter. There are many supported metrics and
possible parameters that can be set in nn_args
, so please see the
article on nearest neighbor descent in uwot
in the documentation, and also the rnndescent package's
documentation for details.
rnndescent
is only a suggested package, not a requirement, so you need to
install it yourself (e.g. via install.packages("rnndescent")
).umap2
, which acts like umap
but with modified defaults,
reflecting my experience with UMAP and correcting some small mistakes. See the
umap2 article for more
details.init_sdev = "range"
caused an error with a user-supplied init
matrix.correlation
metric was actually using the
cosine
metric if you saved and reloaded the model. Thank you
Holly Hall for the report and helpful detective
work (https://github.com/jlmelville/uwot/issues/117).umap_transform
could fail if the new data to be transformed had the
scaled:center
and scaled:scale
attributes set (e.g. from applying the
scale
function).umap_transform
to return the fuzzy graph (
ret_extra = c("fgraph")
), it was transposed when batch = TRUE, n_epochs = 0
.
Thank you PedroMilanezAlmeida for
reporting (https://github.com/jlmelville/uwot/issues/118).n_sgd_threads = "auto"
with umap_transform
caused a crash.dist
class was meant that may have been particularly affecting Seurat users. Thank
you AndiMunteanu for reporting (and
suggesting a solution) (https://github.com/jlmelville/uwot/issues/121).optimize_graph_layout
. Use this to produce optimized output
coordinates that reflect an input similarity graph (such as that produced by
the similarity_graph
function. similarity_graph
followed by
optimize_graph_layout
is the same as running umap
, so the purpose of these
functions is to allow for more flexibility and decoupling between generating
the nearest neighbor graph and optimizing the low-dimensional approximation
to it. Based on a request by user Chengwei94
(https://github.com/jlmelville/uwot/issues/98).simplicial_set_union
and simplicial_set_intersect
. These
allow for the combination of different fuzzy graph representations of a dataset
into a single fuzzy graph using the UMAP simplicial set operations. Based on a
request in the Python UMAP issues tracker by user
Dhar xion.umap_transform
: ret_extra
. This works like the
equivalent parameter for umap
, and should be a character vector specifying the
extra information you would like returned in addition to the embedding, in which
case a list will be returned with an embedding
member containing the optimized
coordinates. Supported values are "fgraph"
, "nn"
, "sigma"
and "localr"
.
Based on a request by user
PedroMilanezAlmeida
(https://github.com/jlmelville/uwot/issues/104).umap
, tumap
and umap_transform
: seed
. This will do
the equivalent of calling set.seed
internally, and hence will help with
reproducibility. The chosen seed is exported if ret_model = TRUE
and
umap_transform
will use that seed if present, so you only need to specify
it in umap_transform
if you want to change the seed. The default behavior
remains to not modify the random number state. Based on a request by
SuhasSrinivasan
(https://github.com/jlmelville/uwot/issues/110).init_sdev
: set init_sdev = "range"
and initial
coordinates will be range-scaled so each column takes values between 0-10. This
pre-processing was added to the Python UMAP package at some point after uwot
began development and so should probably always be used with the default
init = "spectral"
setting. However, it is not set by default to maintain
backwards compatibility with older versions of uwot
.ret_extra = c("sigma")
is now supported by lvish
. The Gaussian bandwidths
are returned in a sigma
vector. In addition, a vector of intrinsic
dimensionalities estimated for each point using an analytical expression of the
finite difference method given by
Lee and co-workers is returned
in the dint
vector.min_dist
and spread
parameters are now returned in the model when
umap
is run with ret_model = TRUE
. This is just for documentation purposes,
these values are not used directly by the model in umap_transform
. If the
parameters a
and b
are set directly when invoking umap
, then both
min_dist
and spread
will be set to NULL
in the returned model. This
feature was added in response to a question from
kjiang18
(https://github.com/jlmelville/uwot/issues/95).n_components
seems to have been set too high.n_components
was greater than n_neighbors
then umap_transform
would
crash the R session. Thank you to ChVav for
reporting this (https://github.com/jlmelville/uwot/issues/102).umap_transform
with a model where dens_scale
was set could cause
a segmentation fault, destroying the session. Even if it didn't it could give
an entirely artifactual "ring" structure. Thank you
FemkeSmit for reporting this and providing
assistance in diagnosing the underlying cause
(https://github.com/jlmelville/uwot/issues/103).binary_edge_weights = TRUE
, this setting was not exported when
ret_model = TRUE
, and was therefore not respected by umap_transform
. This
has now been fixed, but you will need to regenerate any models that used
binary edge weights.init
param said that if there were multiple disconnected
components, a spectral initialization would attempt to merge multiple
sub-graphs. Not true: actually, spectral initialization is abandoned in favor
of PCA. The documentation has been updated to reflect the true state of affairs.
No idea what I was thinking of there.load_model
and save_model
didn't work on Windows 7 due to how the version
of tar
there handles drive letters. Thank you
mytarmail for the report
(https://github.com/jlmelville/uwot/issues/109).10.0), because this can lead to small gradients and poor optimization. Thank you SuhasSrinivasan for the report (https://github.com/jlmelville/uwot/issues/110).
similarity_graph
. If you are more interested in the
high-dimensional graph/fuzzy simplicial set representation of your input data,
and don't care about the low dimensional approximation, the similarity_graph
function offers a similar API to umap
, but neither the initialization nor
optimization of low-dimensional coordinates will be performed. The return value
is the same as that which would be returned in the results list as the fgraph
member if you had provided ret_extra = c("fgraph")
. Compared to getting the
same result via running umap
, this function is a bit more convenient to use,
makes your intention clearer if you would be discarding the embedding, and saves
a small amount of time. A t-SNE/LargeVis similarity graph can be returned by
setting method = "largevis"
.umap_transform
with pre-generated nearest neighbors (also the
error message was completely useless). Thank you to
AustinHartman for reporting this
(https://github.com/jlmelville/uwot/issues/97).fuzzy_simplicial_set
) refactored to behave more like that of previous
versions. This change was breaking the behavior of the CRAN package
bbknnR.dens_weight
. If set to a value between 0 and 1, an attempt
is made to include the relative local densities of the input data in the output
coordinates. This is an approximation to the
densMAP method. A large value of
dens_weight
will use a larger range of output densities to reflect the input
data. If the data is too spread out, reduce the value of dens_weight
. For
more information see the
documentation at the uwot repo.binary_edge_weights
. If set to TRUE
, instead of smoothed
knn distances, non-zero edge weights all have a value of 1. This is how
PaCMAP works and there is
practical and
theoretical
reasons to believe this won't have a big effect on UMAP but you can try it
yourself.ret_extra
:
"sigma"
: the return value will contain a sigma
entry, a vector of the
smooth knn distance scaling normalization factors, one for each observation
in the input data. A small value indicates a high density of points in the
local neighborhood of that observation. For lvish
the equivalent
bandwidths calculated for the input perplexity is returned.rho
will be exported, which is the distance to the
nearest neighbor after the number of neighbors specified by the
local_connectivity
. Only applies for umap
and tumap
."localr"
: exports a vector of the local radii, the sum of sigma
and
rho
and used to scale the output coordinates when dens_weight
is set.
Even if not using dens_weight
, visualizing the output coordinates using a
color scale based on the value of localr
can reveal regions of the input
data with different densities.umap
and tumap
only: new data type for precomputed nearest
neighbor data passed as the nn_method
parameter: you may use a sparse distance
matrix of format dgCMatrix
with dimensions N x N
where N
is the number of
observations in the input data. Distances should be arranged by column, i.e. a
non-zero entry in row j
of the i
th column indicates that the j
th
observation in the input data is a nearest neighbor of the i
th observation
with the distance given by the value of that element. Note that this is a
different format to the sparse distance matrix that can be passed as input to
X
: notably, the matrix is not assumed to be symmetric. Unlike other input
formats, you may have a different number of neighbors for each observation (but
there must be at least one neighbor defined per observation).umap_transform
can also take a sparse distance matrix as its nn_method
parameter if precomputed nearest neighbor data is used to generate an initial
model. The format is the same as for the nn_method
with umap
. Because
distances are arranged by columns, the expected dimensions of the sparse matrix
is N_model x N_new
where N_model
is the number of observations in the
original data and N_new
is the number of observations in the data to be
transformed.n_components = 100
or
higher), RSpectra is recommended and will likely out-perform irlba even if you
have installed a good linear algebra library.init = "laplacian"
returned the wrong coordinates because of a slightly
subtle issue around how to order the eigenvectors when using the random walk
transition matrix rather than normalized graph laplacians.init_sdev
parameter was ignored when the init
parameter was a
user-supplied matrix. Now the input will be scaled.bandwidth
parameter has been changed to give results
more like the current version (0.5.2) of the Python UMAP implementation. This is
likely to be a breaking change for non-default settings of bandwidth
, but this
is not a parameter which is actually exposed by the Python UMAP public API any
more, so is on the road to deprecation in uwot too and I don't recommend you
change this.batch
. If TRUE
, then results are reproducible when
n_sgd_threads > 1
(as long as you use set.seed
). The price to be paid is
that the optimization is slightly less efficient (because coordinates are not
updated as quickly and hence gradients are staler for longer), so it is highly
recommended to set n_epochs = 500
or higher. Thank you to
Aaron Lun who not only came up with a way to
implement this feature, but also wrote an entire
C++ implementation of UMAP which does it
(https://github.com/jlmelville/uwot/issues/83).opt_args
. The default optimization method when batch = TRUE
is Adam. You can control its parameters by
passing them in the opt_args
list. As Adam is a momentum-based method it
requires extra storage of previous gradient data. To avoid the extra memory
overhead you can also use opt_args = list(method = "sgd")
to use a stochastic
gradient descent method like that used when batch = FALSE
.epoch_callback
. You may now pass a function which will be
invoked at the end of each epoch. Mainly useful for producing an image of the
state of the embedding at different points during the optimization. This is
another feature taken from umappp.pca_method
, used when the pca
parameter is supplied to
reduce the initial dimensionality of the data. This controls which method is
used to carry out the PCA and can be set to one of:
"irlba"
which uses irlba::irlba
to calculate a truncated SVD. If this
routine deems that you are trying to extract 50% or more of the singular
vectors, you will see a warning to that effect logged to the console."rsvd"
, which uses irlba::svdr
for truncated SVD. This method uses a
small number of iterations which should give an accuracy/speed up trade-off
similar to that of the
scikit-learn TruncatedSVD
method. This can be much faster than using "irlba"
but potentially at a
cost in accuracy. However, for the purposes of dimensionality reduction as
input to nearest neighbor search, this doesn't seem to matter much."bigstatsr"
, which uses the bigstatsr
package will be used. Note: that this is not a dependency of uwot
.
If you want to use bigstatsr
, you must install it yourself. On platforms
without easy access to fast linear algebra libraries (e.g. Windows), using
bigstatsr
may give a speed up to PCA calculations."svd"
, which uses base::svd
. Warning: this is likely to be very
slow for most datasets and exists as a fallback for small datasets where
the "irlba"
method would print a warning."auto"
(the default) which uses "irlba"
to calculate a truncated SVD,
unless you are attempting to extract 50% or more of the singular vectors,
in which case "svd"
is used.ret_nn = TRUE
. If the names exist in more
than one of the input data parameters listed above, but are inconsistent, no
guarantees are made about which names will be used. Thank you
jwijffels for reporting this.umap_transform
, the learning rate is now down-scaled by a factor of 4,
consistent with the Python implementation of UMAP. If you need the old behavior
back, use the (newly added) learning_rate
parameter in umap_transform
to set
it explicitly. If you used the default value in umap
when creating the model,
the correct setting in umap_transform
is learning_rate = 1.0
.nn_method = "annoy"
and verbose = TRUE
would lead to an error with
datasets with fewer than 50 items in them.umap_transform
(this was incorrectly documented to work).umap_transform
was wrong in other ways: it has now been corrected to indicate that there should
be neighbor data for each item in the test data, but the neighbors and distances
should refer to items in training data (i.e. the data used to build the model).n_neighbors
parameter is now correctly ignored in model generation if
pre-calculated nearest neighbor data is provided.grain_size
didn't do anything.This release is mainly to allow for some internal changes to keep compatibility with RcppAnnoy, used for the nearest neighbor calculations.
umap
and tumap
now note that the
contents of the model
list are subject to change and not intended to be part
of the uwot public API. I recommend not relying on the structure of the model
,
especially if your package is intended to appear on CRAN or Bioconductor, as any
breakages will delay future releases of uwot to CRAN.metric = "correlation"
a distance based on the Pearson
correlation (https://github.com/jlmelville/uwot/issues/22). Supporting this
required a change to the internals of how nearest neighbor data is stored.
Backwards compatibility with models generated by previous versions using
ret_model = TRUE
should have been preserved.nn_method
, for umap_transform
: pass a list containing
pre-computed nearest neighbor data (identical to that used in the umap
function). You should not pass anything to the X
parameter in this case. This
extends the functionality for transforming new points to the case where nearest
neighbor data between the original data and new data can be calculated external
to uwot
. Thanks to Yuhan Hao for contributing the
PR (https://github.com/jlmelville/uwot/issues/63 and
https://github.com/jlmelville/uwot/issues/64).init
, for umap_transform
: provides a variety of options for
initializing the output coordinates, analogously to the same parameter in the
umap
function (but without as many options currently). This is intended to
replace init_weighted
, which should be considered deprecated, but won't be
removed until uwot 1.0 (whenever that is). Instead of init_weighted = TRUE
,
use init = "weighted"
; replace init_weighted = FALSE
with
init = "average"
. Additionally, you can pass a matrix to init
to act as the
initial coordinates.umap_transform
: previously, setting n_epochs = 0
was ignored: at
least one iteration of optimization was applied. Now, n_epochs = 0
is
respected, and will return the initialized coordinates without any further
optimization.verbose = TRUE
: the progress bar calculations were taking up a detectable
amount of time and has now been fixed. With very small data sets (< 50 items) the
progress bar will no longer appear when building the index.n_threads
is now NULL
to provide a bit more protection from
changing dependencies.grain_size
parameter has been undeprecated. As the version that
deprecated this never made it to CRAN, this is unlikely to have affected many
people.grain_size
parameter is now ignored and remains to avoid breaking
backwards compatibility only.ret_extra
, a vector which can contain any combination of:
"model"
(same as ret_model = TRUE
), "nn"
(same as ret_nn = TRUE
) and
fgraph
(see below).ret_extra
vector contains "fgraph"
, the
returned list will contain an fgraph
item representing the fuzzy simplicial
input graph as a sparse N x N matrix. For lvish
, use "P"
instead of
"fgraph
" (https://github.com/jlmelville/uwot/issues/47). Note that there
is a further sparsifying step where edges with a very low membership are removed
if there is no prospect of the edge being sampled during optimization. This is
controlled by n_epochs
: the smaller the value, the more sparsifying will
occur. If you are only interested in the fuzzy graph and not the embedded
coordinates, set n_epochs = 0
.unload_uwot
, to unload the Annoy nearest neighbor indices in
a model. This prevents the model from being used in umap_transform
, but allows
for the temporary working directory created by both save_uwot
and load_uwot
to be deleted. Previously, both load_uwot
and save_uwot
were attempting to
delete the temporary working directories they used, but would always silently
fail because Annoy is making use of files in those directories.init = "spca"
, fixed values of a
and b
(rather than allowing
them to be calculated through setting min_dist
and spread
) and
approx_pow = TRUE
. Using the tumap
method with init = "spca"
is probably
the most robust approach.n_epochs = 0
. This used to behave like (n_epochs = NULL
)
and gave a default number of epochs (dependent on the number of vertices in the
dataset). Now it more usefully carries out all calculations except optimization,
so the returned coordinates are those specified by the init
parameter, so this
is an easy way to access e.g. the spectral or PCA initialization coordinates.
If you want the input fuzzy graph (ret_extra
vector contains "fgraph"
), this
will also prevent the graph having edges with very low membership being removed.
You still get the old default epochs behavior by setting n_epochs = NULL
or to
a negative value.save_uwot
and load_uwot
have been updated with a verbose
parameter so
it's easier to see what temporary files are being created.save_uwot
has a new parameter, unload
, which if set to TRUE
will delete
the working directory for you, at the cost of unloading the model, i.e. it can't
be used with umap_transform
until you reload it with load_uwot
.save_uwot
now returns the saved model with an extra field, mod_dir
, which
points to the location of the temporary working directory, so you should now
assign the result of calling save_uwot
to the model you saved, e.g.
model <- save_uwot(model, "my_model_file")
. This field is intended for use
with unload_uwot
.load_uwot
also returns the model with a mod_dir
item for use with
unload_uwot
.save_uwot
and load_uwot
were not correctly handling relative paths.load_uwot
in uwot 0.1.4 to work with newer versions
of RcppAnnoy (https://github.com/jlmelville/uwot/issues/31) failed in the
typical case of a single metric for the nearest neighbor search using all
available columns, giving an error message along the lines of:
Error: index size <size> is not a multiple of vector size <size>
. This has now
been fixed, but required changes to both save_uwot
and load_uwot
, so
existing saved models must be regenerated. Thank you to reporter
OuNao.n_threads
caused a crash.
This was particularly insidious if running with a system with only one default
thread available as the default n_threads
becomes 0.5
. Now n_threads
(and n_sgd_threads
) are rounded to the nearest integer.ERROR: there is already an InterruptableProgressMonitor instance defined
.verbose = TRUE
, the a
, b
curve parameters are now logged.Even with a fix for the bug mentioned above, if the nearest neighbor index file
is larger than 2GB in size, Annoy may not be able to read the data back in. This
should only occur with very large or high-dimensional datasets. The nearest
neighbor search will fail under these conditions. A work-around is to set
n_threads = 0
, because the index will not be written to disk and re-loaded
under these circumstances, at the cost of a longer search time. Alternatively,
set the pca
parameter to reduce the dimensionality or lower n_trees
, both of
which will reduce the size of the index on disk. However, either may lower the
accuracy of the nearest neighbor results.
Initial CRAN release.
tmpdir
, which allows the user to specify the temporary
directory where nearest neighbor indexes will be written during Annoy
nearest neighbor search. The default is base::tempdir()
. Only used if
n_threads > 1
and nn_method = "annoy"
.Fixed an issue with lvish
where there was an off-by-one error when
calculating input probabilities.
Added a safe-guard to lvish
to prevent the gaussian precision, beta,
becoming overly large when the binary search fails during perplexity
calibration.
The lvish
perplexity calibration uses the log-sum-exp trick to avoid
numeric underflow if beta becomes large.
pcg_rand
. If TRUE
(the default), then a random number
generator from the PCG family is used during the
stochastic optimization phase. The old PRNG, a direct translation of
an implementation of the Tausworthe "taus88" PRNG used in the Python
version of UMAP, can be obtained by setting pcg_rand = FALSE
. The new PRNG is
slower, but is likely superior in its statistical randomness. This change in
behavior will be break backwards compatibility: you will now get slightly
different results even with the same seed.fast_sgd
. If TRUE
, then the following combination of
parameters are set: n_sgd_threads = "auto"
, pcg_rand = FALSE
and approx_pow = TRUE
. These will result in a substantially faster optimization phase, at the
cost of being slightly less accurate and results not being exactly repeatable.
fast_sgd = FALSE
by default but if you are only interested in visualization,
then fast_sgd
gives perfectly good results. For more generic dimensionality
reduction and reproducibility, keep fast_sgd = FALSE
.init_sdev
which specifies how large the standard deviation
of each column of the initial coordinates should be. This will scale any input
coordinates (including user-provided matrix coordinates). init = "spca"
can
now be thought of as an alias of init = "pca", init_sdev = 1e-4
. This may be
too aggressive scaling for some datasets. The typical UMAP spectral
initializations tend to result in standard deviations of around 2
to 5
, so
this might be more appropriate in some cases. If spectral initialization detects
multiple components in the affinity graph and falls back to scaled PCA, it
uses init_sdev = 1
.init_sdev
, the init
options sspectral
,
slaplacian
and snormlaplacian
have been removed (they weren't around for
very long anyway). You can get the same behavior by e.g.
init = "spectral", init_sdev = 1e-4
. init = "spca"
is sticking around
because I use it a lot.init = "spca"
.<random>
header. This breaks backwards
compatibility even if you set pcg_rand = FALSE
.metric = "cosine"
results were incorrectly using the unmodified Annoy
angular distance.categorical
metric (fixes https://github.com/jlmelville/uwot/issues/20).n_components
(e.g.
approximately 50% faster optimization time with MNIST and n_components = 50
).pca_center
, which controls whether to center the data before
applying PCA. It would be typical to set this to FALSE
if you are applying
PCA to binary data (although note you can't use this with setting with
metric = "hamming"
)metric
is "manhattan"
and "cosine"
. It's
still not applied when using "hamming"
(data still needs to be in binary
format, not real-valued).pca
and pca_center
parameter values for a given data block by using a list for the value of the
metric, with the column ids/names as an unnamed item and the overriding values
as named items, e.g. instead of manhattan = 1:100
, use
manhattan = list(1:100, pca_center = FALSE)
to turn off PCA centering for
just that block. This functionality exists mainly for the case where you have
mixed binary and real-valued data and want to apply PCA to both data types. It's
normal to apply centering to real-valued data but not to binary data.umap_transform
, where negative sampling was over
the size of the test data (should be the training data).verbose = TRUE
, log the Annoy recall accuracy, which may help tune
values of n_trees
and search_k
.n_sgd_threads
, which controls the number of threads used
in the stochastic gradient descent. By default this is now single-threaded
and should result in reproducible results when using set.seed
. To get back
the old, less consistent, but faster settings, set n_sgd_threads = "auto"
.alpha
is now learning_rate
.gamma
is now repulsion_strength
.laplacian
and normlaplacian
).init
options: sspectral
, snormlaplacian
and slaplacian
. These are
like spectral
, normlaplacian
, laplacian
respectively, but scaled so that
each dimension has a standard deviation of 1e-4. This is like the difference
between the pca
and spca
options.pca
: set this to a positive integer to reduce matrix of
data frames to that number of columns using PCA. Only works if
metric = "euclidean"
. If you have > 100 columns, this can substantially
improve the speed of the nearest neighbor search. t-SNE implementations often
set this value to 50.metric
: instead of
specifying a single metric name (e.g. metric = "euclidean"
), you can pass a
list, where the name of each item is the metric to use and the value is a vector
of the names of the columns to use with that metric, e.g.
metric = list("euclidean" = c("A1", "A2"), "cosine" = c("B1", "B2", "B3"))
treats columns A1
and A2
as one block, using the Euclidean distance to find
nearest neighbors, whereas B1
, B2
and B3
are treated as a second block,
using the cosine distance.categorical
.y
may now be a data frame or matrix if multiple target data is available.target_metric
, to specify the distance metric to use with
numerical y
. This has the same capabilities as metric
.scale = "Z"
To Z-scale each column of input (synonym for scale = TRUE
or scale = "scale"
).scale = "colrange"
to scale columns in the range (0, 1).y
, you may pass nearest neighbor data
directly, in the same format as that supported by X
-related nearest neighbor
data. This may be useful if you don't want to use Euclidean distances for
the y
data, or if you have missing data (and have a way to assign nearest neighbors
for those cases, obviously). See the
Nearest Neighbor Data Format
section for details.ret_nn
: when TRUE
returns nearest neighbor matrices
as a nn
list: indices in item idx
and distances in item dist
. Embedded
coordinates are in embedding
. Both ret_nn
and ret_model
can be TRUE
,
and should not cause any compatibility issues with supervised embeddings.nn_method
can now take precomputed nearest neighbor data. Must be a list of
two matrices: idx
, containing integer indexes, and dist
containing
distances. By no coincidence, this is the format return by ret_nn
.n_components = 1
was broken
(https://github.com/jlmelville/uwot/issues/6)init
parameter were being modified, in defiance of
basic R pass-by-copy semantics.metric = "cosine"
is working again for n_threads
greater than 0
(https://github.com/jlmelville/uwot/issues/5)August 5 2018. You can now use an existing embedding to add new points via
umap_transform
. See the example section below.
August 1 2018. Numerical vectors are now supported for supervised dimension reduction.
July 31 2018. (Very) initial support for supervised dimension reduction:
categorical data only at the moment. Pass in a factor vector (use NA
for
unknown labels) as the y
parameter and edges with bad (or unknown) labels are
down-weighted, hopefully leading to better separation of classes. This works
remarkably well for the Fashion MNIST dataset.
July 22 2018. You can now use the cosine and Manhattan distances with the
Annoy nearest neighbor search, via metric = "cosine"
and metric = "manhattan"
, respectively. Hamming distance is not supported because RcppAnnoy
doesn't yet support it.