Compare two numbers for "likeness"

Question

This is part of a search function on a website. So im trying to find a way to get to the end result as fast as possible.

Have a binary number where digit order matters.

Input Number = 01001

Have a database of other binary numbers all the same length.

01000, 10110, 00000, 11111

I dont know how to write what im doing, so im going to do it more visually below.

// Zeros mean nothing & the location of a 1 matters, not the total number of 1's.    
input num > 0 1 0 0 1 = 2 possible matches
number[1] > 0 1 0 0 0 = 1 match = 50% match
number[2] > 1 0 1 1 0 = 0 match = 0% match
number[3] > 0 0 0 0 0 = 0 match = 0% match
number[4] > 1 1 1 1 1 = 2 match = 100% match

Now obviously, you could go digit by digit, number by number and compare it that way (using a loop and what not). But I was hoping there might be an algorithm or something that will help. Mostly because in the above example I only used 5 digit numbers. But im going to be routinely comparing around 100,000 numbers with 200 digits each, that's a lot of calculating.

I usually deal with php and MySQL. But if something spectacular comes up I could always learn.

Doesn't the order of the digits matter for any binary number? Aside from that, could you please explain what you're trying to accomplish? I'm not really understanding the purpose of this whole scheme. — Blender, Sep 05 '11 at 22:56
Just an idea from the top of my head, why don't you start with comparing the relationship between numbers as in X > Z, X < Z, etc... — Bassem, Sep 05 '11 at 22:56
What exactly do you need to do? Do you just need to test for the existence of a 100% match, or do you need every possible % match across the entire database? Testing for a match can be O(k), for k = number of digits, although testing for every % match is going to be O(n*k), for n = number of database entries. I'm not seeing a way to reduce n or k, since you need to examine all digits and all entries, so the problem domain has to reduce the criteria somehow. — Stefan Kendall, Sep 05 '11 at 23:13
Yes, i suppose order of digits matter for binary numbers. :) Anyways, this is part of a search funtion where 1 equals a positive attribute desired by a user in an item. So in a html form for example... option 1 equals the first digit, option 2 equals the second digit, etc etc... So say a user searches for an item that has the traits represented by 100110. Which items in the database has the most closely matching traits. This is just what I thought of doing to acomplish this. I want to be able to return a list of items in order of percent closeness to the input value. — Adam, Sep 06 '11 at 19:11

score 4 · Accepted Answer · edited May 23 '17 at 12:04

4

If it's possible to somehow chop up your bitstrings in integer-size chunks some elementary boolean arithmetic would do, and that kind of instructions is generally pretty fast

$matchmask = ~ ($inputval ^ $tomatch) & $inputval

What this does:

the xor determines the bits that are different in the inputval and tomatch
negation gives a value where all bits that are equal in inputval and tomatch are set
and that with inputval and only the bits that are 1 in both inputval and tomatch remain set.

Then count the number of bits set in the result, look at How to count the number of set bits in a 32-bit integer? for an optimal solution, easily translated into php

edited May 23 '17 at 12:04

Community

1
1

answered Sep 06 '11 at 00:14

fvu

32,488
6
61
79

Man it took me a while to learn about what the heck you wrote. :) Anyways, your answer would be perfect if I was dealing with smaller numbers. I _could_ split the numbers into smaller segments as you suggest. But I think im going another direction. – Adam Sep 07 '11 at 03:04
I'll choose your answer if nothing else shows up in the next day. – Adam Sep 07 '11 at 03:05
ok thanks. An educated guess: thanks to the speed and simplicity of the technique explained above, in terms of processing time it's often less costly to map bits to words or dwords, apply the required logic word-wise or dword-wise and "remap" the resulting words back to individual bits, because it will reduce you loop count by a factor of 16 or 32. In your case the last step is not even necessary as you just want the sum of sideways counts. – fvu Sep 07 '11 at 09:19

Manlio · Answer 2 · 2011-09-05T23:29:15.373

1

Well, the first thing I can think of is a simple bitwise AND between the two numbers; you can then analyze the result to get the match percentage:

if( result >= input ) 
    //100% match
else {
    result ^= input;

    /* The number of 1's in result is the number of 1 of "input" 
     * that are missing in "result".
     */
}

Of course, you'll need to implement your own AND and XOR function (this will work only for 32 bit integers). Note that it works only with unsigned numbers.

edited Sep 05 '11 at 23:29

answered Sep 05 '11 at 23:07

Manlio

10,768
9
50
79

>= isn't the 100% match case. The 1s cover, so that's why it's a 100% match. Consider 10000 and 01111. That's a 0% match, given the OPs description – Stefan Kendall Sep 05 '11 at 23:16
Sorry, don't understand your objection. The only problem I can think of is that the first 1 may be a sign digit (added a note to underline this). According to the OPs description 10000 and 01111 is 0% match. According to my code: `result = input AND firstNumber = 00000 < 10000` (think of unsigned number), so it's the else statement. `result ^= input` => `result = 10000`. The number of 1's in result is the number of 1 of `input` that are missing in result. – Manlio Sep 05 '11 at 23:28
Ah, this wasn't clear from the text. I went straight to block and read result as current-item-to-test. Corrected vote. – Stefan Kendall Sep 06 '11 at 00:09

score 1 · Answer 3 · answered Sep 05 '11 at 23:26

Instead of checking each bit, you could pre-process the input and determine which bits need checking. In the worst case, this devolves into processing each bit, but for a normal distribution, you'll save some processing.

That is, for input

01001, iterate over the database and determine if number1[0] & input is non-zero, and (number1[3] >> 8) & input is non-zero, assuming 0 as the index of the LSB. How you get fast bit-shifting and anding with the large numbers is on you, however. If you detect 1s than 0s in the input, you could always invert the input and test for zero to detect coverage.

This will give you modest improvement, but it's at best a constant-time reduction of the problem. If most of your inputs are balanced between 0s and 1s, you'll halve the number of required operations. If it's more biased, you'll get better results.

I don't have time to read all the answers at the moment. But to first "determine which bits need checking". I can't believe i didn't think of doing a pre check of the input number to narrow down the number of bits that need checking in each number. Because I only care about 1's and which location they're in. So instead of checking each digit of a 200 digit number looking for 1's. I would only have to check about 30 locations in that number. That saves a lot. I would create an array with the locations/digits to check for a "1" match. — Adam, Sep 06 '11 at 19:29

score 0 · Answer 4 · answered Sep 05 '11 at 23:05

0

Suppose the input number is called A (so in your example A = 01001) and the other number is x. You'll have 100% match when x & A == A. Otherwise, for partial matches, the number of 1 bits will be (taken from hacker's delight):

x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
x = (x & 0x0000FFFF) + ((x >>16) & 0x0000FFFF);

Note this will work for 32 bits integers.

answered Sep 05 '11 at 23:05

BlackBear

22,411
10
48
86

How does this help for 200 digit binary numbers? – Stefan Kendall Sep 05 '11 at 23:08
And what datatype, pray tell, will support 200 bits and a hardware &? – Stefan Kendall Sep 05 '11 at 23:14

Janick Bernet · Answer 5 · 2011-09-05T23:14:54.637

0

Let's assume you have a function bit1count, then from what you describe, the "likeness" formula should be:

100.0 / min(bit1count(n1), bit1count(n2)) * bit1count(n1 & n2)

With n1 and n2 being the two numbers and & being the logical and operator.

bit1count can be easily implemented using a loop, or, more elegant, using the algorithm provided in BigBears answer.

There is actually a BIT_COUNT in mysql, so something like this should work:

SELECT 100.0 / IF(BIT_COUNT(n1) < BIT_COUNT(n2), BIT_COUNT(n1), BIT_COUNT(n2)) * BIT_COUNT(n1 & n2) FROM table

edited Sep 05 '11 at 23:14

answered Sep 05 '11 at 23:06

Janick Bernet

20,544
2
29
55

Whover downvoted me: If it was because of the max, I replaced it by a min, given the last of the examples being 100%. – Janick Bernet Sep 05 '11 at 23:15

Compare two numbers for "likeness"

5 Answers5