Optimal way to cluster set of strings with hamming distance

Question

I have a database with n strings (n > 1 million), each string has 100 chars, each char is either a, b, c or d.

I would like to find the closest strings for each one, closest defines as having the smallest hamming distance. I would like to find the k-nearest strings for each one (k < 5).

Example

N = 5
i1 = aacbdbbb
i2 = abcbdbbb
i3 = bbcadabd
i4 = bbcadabb

HammingDistance(i1,i2) = 1
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 4
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 3
HammingDistance(i3,i4) = 1

For k=1 it should return {(i1,i2),(i2,i1),(i3,i4),(i4,i3)}. For k=2 it should return { (i1,{i2,i4}),(i2,{i1,i4}),(i3,{i4,i2}),(i4,{i3,i2})}. And so on. For each string should find k-nearest string.

Naive solution has O(n^2) time complexity. I would like to find solution with better complexity. I found some other solutions, but none of them was better than naive.

How can I optimally distribute such strings into clusters? One string could be in both or more clusters. Solution may be deterministic or probabilistic.

@AdamStelmaszczyk It's very close but not exact. I'd probably join you anyway if the answers to that one were of better quality. — David Eisenstat, Jan 02 '15 at 15:38
True, but that points to http://stackoverflow.com/questions/7086100/fast-computation-of-pairs-with-least-hamming-distance which answers it about as well as it reasonably can be answered. — btilly, Jan 02 '15 at 16:16
@AdamStelmaszczyk - k in this question is less than 5, however k in that one is general. In addition, sequences are not bitstring. The other question you mentioned is not improved complexity. — Arman, Jan 02 '15 at 17:26
@AdamStelmaszczyk Solution is not difficult if sequences are bitstring. — Arman, Jan 02 '15 at 17:30

Optimal way to cluster set of strings with hamming distance

Example

0 Answers0

Linked