4

Problem:

I have N (~100m) strings each D (e.g. 100) characters long and with a low alphabet (eg 4 possible characters). I would like to find the k-nearest neighbors for every one of those N points ( k ~ 0.1D). Adjacent strings define via hamming distance. Solution doesn't have to be the best possible but closer the better.

Thoughts about the problem

I have a bad feeling this is a non trivial problem. I have read many papers and algorithms, however most of them has poor result in high dimension and it works when dimension is less than 5. For example this paper suggests an efficient algorithm, but its constant is related to dimension exponentially.

Currently, I am investigating on how can I reduce dimension in a way that hamming distance is preserved or can be computed.

Another option is locality sensitive hashing, Points that are close to each other under the chosen metric are mapped to the same bucket with high probability. Any Help? which option do you prefer?

Ashki
  • 63
  • 4
  • "which is the k-nearest neighbor": do you mean "which are the k-nearest neighbors" ? And what is 100m ?? –  Jan 31 '15 at 14:01
  • Yes, I have corrected. 100m indicates number of strings, so we have 100 million strings. – Ashki Feb 07 '15 at 11:06

1 Answers1

3

One of the previously asked questions has some good discussions, so you can refer to that,

Nearest neighbors in high-dimensional data?

Other than this, you can also look at,

http://web.cs.swarthmore.edu/~adanner/cs97/s08/papers/dahl_wootters.pdf

Few papers which analyze different approaches,

http://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf

https://www.cse.ust.hk/~yike/sigmod09-lsb.pdf

Community
  • 1
  • 1
sarveshseri
  • 13,738
  • 28
  • 47
  • I remember having seen other very similar questions about string similarity with a small alphabet. Try searching some more here on SO. – Has QUIT--Anony-Mousse Jan 28 '15 at 12:08
  • All material you mentioned is related to continuous space, and I think changing discrete space to continuous don't work since data isn't spread in all space. In addition, measurement in my problem is hamming distance, however those materials use Euclidean distance. – Ashki Jan 30 '15 at 20:11