11

I have an array of 256-bit values. The array is huge (millions of records) but it is modified just rarely and it fits to the memory. For a given 256-bit number I want to find whether there exists a record which has at least N bits equal. For instance, 10000 and 01111 have 0 bits equal, 1000 and 1001 have 3 bits equal. Always N > 128, or rather N > 140. I don't need to find a particular number, I just need to find whether such number exists in the list or not.

Is there a type of data structure or some kind of index which could somehow speed up the searching?

Qantas 94 Heavy
  • 15,750
  • 31
  • 68
  • 83
smrt28
  • 469
  • 3
  • 20
  • Interesting problem. A [nearest neighbor search](http://en.wikipedia.org/wiki/Nearest_neighbor_search) algorithm would do it, though that would do more work than necessary. If the bits are anything close to random, the answer will almost always be "yes". I feel like there should be a way to do this very, very quickly. – user2357112 Jan 24 '14 at 19:39
  • I had the same feeling :-( – smrt28 Jan 24 '14 at 20:14
  • 3
    This question was about 32 bits, but the answer probably still applies at least partly: http://stackoverflow.com/q/6389841/555045 – harold Jan 24 '14 at 20:23
  • What will the data look like? The random case is easy, but it's quite unlikely you'll have random data. Should we expect a lot of "yes" answers or "no" answers? Should we expect a lot of queries right on the border? – user2357112 Jan 24 '14 at 20:31
  • All the inputs are absolutely random. This wont help you. In fact, the 256bit number are hashes. But I can accept some false negative errors. So, it wont be a problem if the algorithm tells in 0.1% cases a wrong answer - "there is no such record" even if there is. I also feel, the fact that N must be > 140 could be a significant help. – smrt28 Jan 24 '14 at 21:21
  • 1
    ...well, I think it would be fair if I reveal the problem behind my question. I'm working on a spam filtering an I discovered NILSIMCA hash algorithm. http://en.wikipedia.org/wiki/Nilsimsa_Hash I suppose I'll have many NILSIMCA hashes which will be the spam samples. When an email arives I need to wall through all of them, search whether the email is spam or not. – smrt28 Jan 24 '14 at 21:23
  • @smrt28: Do you need an online algorithm, or are you given all the queries upfront? – Niklas B. Jan 24 '14 at 21:32
  • I don't think you can do a lot better than linear search. Linear search can be optimized very well, however. I assume you can get under 0.5 seconds/query on a CPU and obviously linear speedup if you have multiple cores at hand. It would also be straightforward to delegate the work to a GPU and get huge speedups. What kind of throughput do you expect? – Niklas B. Jan 24 '14 at 21:51
  • I suppose I would build a database of the hashes once per day based on collected spam reports and then I would have to handle many emails per sec. So, it would be an online algorithm. – smrt28 Jan 24 '14 at 21:51
  • @smrt28: See my other comment. How many mails? And how many records? And how much processing power? :) – Niklas B. Jan 24 '14 at 21:51
  • I'm just trying to save as much money as possible to my company, so I'm searching for the most efficient solution :-) – smrt28 Jan 24 '14 at 21:55
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/46046/discussion-between-smrt28-and-niklas-b) – smrt28 Jan 24 '14 at 21:55

2 Answers2

1

The algorithm to solve that is O(n). Your only option is looping through the array until finding a number that matches with your target.

Now, how do we find if two numbers match?. The most efficient way is using bitwise operations. For instance I write a code that works for 64 bits long numbers:

int compare(long l1, long l2)
{
    //Use xor to get what bits are the same
    long xor = ~ (l1 ^ l2);

    //count the number of bits with value = 1
    int count = 0;
    while(xor != 0)
    {
        //check if the right bit is 1, using & operator beetween our number and 0000001
        if(xor & 1 > 0) count++;

        //shift right the bits
        xor = xor >> 1
    }

    return count;
}

This comparation can be adapted, depending on how your 256 bits numbers are implemented. An optimization could be to break the while loop when you reach count >= N.

You can check this question to look for more efficient ways to count the bits with value 1.

Hope it helps!

Community
  • 1
  • 1
conca
  • 1,394
  • 9
  • 10
  • 1
    "Your only option is looping through the array until finding a number that matches with your target." Can you prove that? Why is there not a better way? – Niklas B. Jan 25 '14 at 02:00
  • if the array is not sorted and you doesnt have any additional information about the items, you have to loop throught the array until finding the match. It is the same as finding the max integer of an array, in the worst case it takes you `n` moves to find the max item, where `n` is the length of the array. – conca Jan 27 '14 at 15:21
  • But you *can* sort the array and speed up the search and you *can* preprocess it to store additional info. OP is looking for a way to speed up the process. – Niklas B. Jan 27 '14 at 15:47
  • You are right! if you are going to perform more than 1 query, you can do some preprossesing that will take you at least `O(n)`, but reducing the complexity of each query. I will take a look again to this problem, it's very interesting, thank you! – conca Jan 27 '14 at 16:07
1

I may be wrong, but it looks like you can index your data as bitwise trie, which for your case is identical to binary tree in structure, but is interpreted differently. Here's illustration (source):

enter image description here

For the sake of simplicity, let's think about your 256-bit numbers as about vectors with 256 numbers (also binary, of course). Having a tree like above, you can find whether particular vector exists in your dataset in 256 steps or less by just walking down the tree and testing if subsequent branch (either "0" or "1") exists. Something like this (code's quite raw, but you should get an idea):

def exists(node, v, i):
    """Checks if subvector of v starting from index i exists in 
       sub-tree defined by node
    """
    if i == len(v): 
        return True
    elif v[i] == 0 and node.left: 
        return exists(node.left, v, i + 1)
    elif v[i] == 1 and node.right: 
        return exists(node.right, v, i + 1)
    else: 
        return False

At each step we decide where to go left or right branch based on current value element of v. But you also need to process up to K different elements. Ok, so why not to use error count and process both branches?

def exists(node, v, i, err_left=K):
    """Checks if subvector of v starting from index i exists in 
       sub-tree defined by node with up to err_left errors.
    """
    if i == len(v): 
        return True
    elif v[i] == 0: 
        if node.left:
            return exists(node.left, v, i + 1, err_count)
        elif node.right: 
            return exists(node.left, v, i + 1, err_left - 1)  # proceed, but decrease 
                                                              #   errors left
        else: 
            return False            
    elif v[i] == 1: 
        if node.right:
            return exists(node.right, v, i + 1, err_count)
        elif node.left: 
            return exists(node.right, v, i + 1, err_left - 1)  # proceed, but decrease 
                                                               #   errors left
        else: 
            return False 
    else: 
        return False

Running time of this algorithm heavily depends on your settings. In worst case (K = 256) it is still O(n) (you need to check each branch), but this time falls quickly as K decreases (with small K it is almost O(log n)). With K ~ 128 it's somewhere in the middle.

You may also want to rewrite function so that it checked good-day (no errors) scenarios first (e.g. if you expect number of errors be small on average).

Trie compressed data in a certain sense, but if you have issues with memory, try to use table representation instead of objects and pointers.

Finally, you can combine it with bloom filters to filter numbers that definitely are not in a set first, and then check the rest with above tree.

ffriend
  • 27,562
  • 13
  • 91
  • 132