BiocNeighbors 1.20.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 8367 8024 412 7452 4986 8059 8767 5863 5568 9170
## [2,] 7505 5084 4941 9248 8300 2767 2588 5815 2414 6779
## [3,] 1423 6033 4754 5243 1063 7402 5736 569 2958 2372
## [4,] 4099 6815 4468 4392 8794 2570 9754 4784 9030 4412
## [5,] 581 2481 8556 6514 4270 6759 2803 4132 5325 7112
## [6,] 8144 6697 3402 9300 5248 3157 7580 2971 1667 1060
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9171678 0.9425159 0.9431646 0.9572297 0.9644480 0.9805946 1.0174010
## [2,] 0.9621327 0.9670848 0.9929832 0.9993303 1.0228894 1.0396564 1.0438344
## [3,] 0.8206025 0.8798296 0.8804766 0.9020995 0.9048094 0.9160107 0.9232739
## [4,] 0.8161841 0.8854532 0.9174184 0.9401400 0.9851396 0.9930533 0.9965966
## [5,] 1.0427125 1.0645164 1.0669452 1.0683836 1.1139536 1.1449358 1.1469938
## [6,] 0.7417828 0.9002681 0.9140049 0.9155242 0.9189631 0.9382798 0.9427933
## [,8] [,9] [,10]
## [1,] 1.0249903 1.0295681 1.0335536
## [2,] 1.0509665 1.0897555 1.0998900
## [3,] 0.9412209 0.9614992 0.9652658
## [4,] 1.0189202 1.0367405 1.0378553
## [5,] 1.1491128 1.1525401 1.1539816
## [6,] 0.9929794 0.9986178 1.0044509
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 6822 4426 1715 7508 2464
## [2,] 957 5985 2454 2068 3042
## [3,] 9306 7186 6908 9102 7949
## [4,] 866 6060 1349 6982 8261
## [5,] 8861 443 188 3160 1484
## [6,] 9055 4826 3262 5514 6679
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9289543 1.0253005 1.0686672 1.0935981 1.1488045
## [2,] 0.7525633 0.8865743 0.8985186 0.9768201 0.9948143
## [3,] 0.9024137 0.9536997 1.0628147 1.0737333 1.0961367
## [4,] 0.8370829 0.9533898 0.9977849 1.0355099 1.0538019
## [5,] 0.7983236 0.8354070 0.8403973 0.8904105 0.9825965
## [6,] 0.9309431 1.0334160 1.0844851 1.1001310 1.1003900
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpesdGhm/fileb34145b0b1486.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.20.0 knitr_1.44 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 rlang_1.1.1 xfun_0.40
## [4] jsonlite_1.8.7 S4Vectors_0.40.0 htmltools_0.5.6.1
## [7] stats4_4.3.1 sass_0.4.7 rmarkdown_2.25
## [10] grid_4.3.1 evaluate_0.22 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.7 bookdown_0.36
## [16] BiocManager_1.30.22 compiler_4.3.1 codetools_0.2-19
## [19] Rcpp_1.0.11 BiocParallel_1.36.0 lattice_0.22-5
## [22] digest_0.6.33 R6_2.5.1 parallel_4.3.1
## [25] bslib_0.5.1 Matrix_1.6-1.1 tools_4.3.1
## [28] BiocGenerics_0.48.0 cachem_1.0.8