7

Given array of bitstrings (all of the same length) and query string Q find top-k most similar strings to Q, where similarity between strings A and B is defined as number of 1 in A and B, (operation and is applied bitwise).

I think there is should be a classical result for this problem.

k is small, in hundreds, while number of vectors in hundreds of millions and length of the vectors is 512 or 1024

Moonwalker
  • 2,180
  • 1
  • 29
  • 48
  • 1
    Looks to me you are looking for a *trie*... – Willem Van Onsem Jun 09 '16 at 19:39
  • 3
    Can you add a sample data and result? – displayName Jun 30 '16 at 19:19
  • 1
    What are the precise characteristic of your problem? Is the length of the bitstrings large? Is k large? Or the number of strings? On first sight you are essentially looking for the largest sums only summing the entries where Q is set to 1. You could data parallelize it on several nodes, locally computing the top-k and then aggregating the results. You should keep your top-k in a min-heap on each node and compare the current data entry with the min there (and if it is larger replace). Then you efficiently merge all the heaps and turn them into max, then pop the k entries. – uberwach Jun 30 '16 at 20:57
  • @uberwach k is small, in hundreds, while number of vectors in hundreds of millions and length of the vectors is 512 or 1024. – Moonwalker Jul 01 '16 at 01:54
  • 1
    Do you have only one Q or are you expecting several query strings Q to run? – uberwach Jul 01 '16 at 02:20
  • @uberwach I'm expecting ongoing stream of different query strings. This stuff is like search engine: got query, return most relevant results, got query return results, and so on. There may be several query requests simultaneously. – Moonwalker Jul 01 '16 at 02:40
  • Are the bitstrings randomly distributed? Or is there something special about them, for example the number of ones is limited? Or probability of a bit being set is perhaps non-uniform across bits? – maniek Jul 01 '16 at 22:26
  • @maniek those bitstrings are bloom-filters, if this of any help... – Moonwalker Jul 02 '16 at 00:52
  • Please put the information about the problem size at hand in the question. – greybeard Jul 06 '16 at 05:20

2 Answers2

3

One way to tackle this problem is to construct a K-Nearest Neighbor Graph (K-NNG) (digraph) with a Russell-Rao similarity function.

Note that efficient K-NNG construction is still an open problem,and none of the known solutions for this problem is general, efficient and scalable [quoting from Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures - Dong, Charikar, Li 2011].

Your distance function is often called Russell-Rao similarity (see for example A Survey of Binary Similarity and Distance Measures - Choi, Cha, Tappert 2010). Note that Russell-Rao similarity is not a metric (see Properties of Binary Vector Dissimilarity Measures - Zhang, Srihari 2003): The "if" part of "d(x, y) = 0 iff x == y" is false.

In A Fast Algorithm for Finding k-Nearest Neighbors with Non-metric Dissimilarity - Zhang, Srihari 2002, the authors propose a fast hierarchical search algorithm to find k-NNs using a non-metric measure in a binary vector space. They use a parametric binary vector distance function D(β). When β=0, this function is reduced to the Russell-Rao distance function. I wouldn't call it a "classical result", but this is the the only paper I could find that examines this problem.

You may want to check these two surveys: On nonmetric similarity search problems in complex domains - Skopal, Bustos 2011 and A Survey on Nearest Neighbor Search Methods - Reza, Ghahremani, Naderi 2014. Maybe you'll find something I missed.

Lior Kogan
  • 19,919
  • 6
  • 53
  • 85
  • 1
    `efficient K-NNG construction is still an open problem` `number of vectors in hundreds of millions` - ouch. – greybeard Jul 06 '16 at 05:18
1

This problem can be solved by writing simple Map and Reduce job. I'm neither claiming that this is the best solution, nor I'm claiming that this is the only solution.

Also, you have disclosed in the comments that k is in hundreds, there are millions of bitstrings and that the size of each of them is 512 or 1024.


Mapper pseudo-code:

  • Given Q;
  • For every bitstring b, compute similarity = b & Q
  • Emit (similarity, b)

Now, the combiner can consolidate the list of all bitStrings from every mapper that have the same similarity.

Reducer pseudo-code:

  • Consume (similarity, listOfBitStringsWithThisSimilarity);
  • Output them in decreasing order for similarity value.

From the output of reducer you can extract the top-k bitstrings.

So, MapReduce paradigm is probably the classical solution that you are looking for.

displayName
  • 13,888
  • 8
  • 60
  • 75
  • 1
    @Moonwalker: I just read that you want to develop a search-engine like system in a comment on your question. In that case, please go ahead and use what I'm suggesting you. If you don't know, MapReduce paradigm was conceived at Google only to help them better process web data. Your system is along the same lines and will benefit from this. Also, MapReduce can be implemented on almost any platform. – displayName Jul 01 '16 at 21:30