0

I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a, b, c or d.

I would like to find the closest strings for each one, closest defines as having the smallest hamming distance. I would like to find the k-nearest strings for each one (k < 5).

Example

N = 5
i1 = aacbdbbb
i2 = abcbdbbb
i3 = bbcadabd
i4 = bbcadabb

HammingDistance(i1,i2) = 1
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 4
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 3
HammingDistance(i3,i4) = 1

For k=1 it should return {(i1,i2),(i2,i1),(i3,i4),(i4,i3)}. For k=2 it should return { (i1,{i2,i4}),(i2,{i1,i4}),(i3,{i4,i2}),(i4,{i3,i2})}. And so on. For each string should find k-nearest string.

Naive solution has O(n^2) time complexity. I would like to find solution with better complexity. I found some other solutions, but none of them was better than naive.

How can I optimally distribute such strings into clusters? One string could be in both or more clusters. Solution may be deterministic or probabilistic.

Community
  • 1
  • 1
Arman
  • 437
  • 3
  • 19
  • 1
    @AdamStelmaszczyk It's very close but not exact. I'd probably join you anyway if the answers to that one were of better quality. – David Eisenstat Jan 02 '15 at 15:38
  • 1
    True, but that points to http://stackoverflow.com/questions/7086100/fast-computation-of-pairs-with-least-hamming-distance which answers it about as well as it reasonably can be answered. – btilly Jan 02 '15 at 16:16
  • @AdamStelmaszczyk - k in this question is less than 5, however k in that one is general. In addition, sequences are not bitstring. The other question you mentioned is not improved complexity. – Arman Jan 02 '15 at 17:26
  • @AdamStelmaszczyk Solution is not difficult if sequences are bitstring. – Arman Jan 02 '15 at 17:30

0 Answers0