1

I have a database with binary these strings

record no 1: 1111111111111011000100110001100100010000000000000011000000000000
record no 2: 1111111111111111111111100001100000010000000000000011000000000000
record no 3: 1110000011110000111010001110111011110000111100001100000011000000
...

So, i want to find out what record had similar bỉnary string with this: 1111111111111011000100110001100100010000000000000011000000001100

You can see, the record number 1 is 98% relevance. record number 2 is 70% relevance, and record number 3 is only 45% percent relevance.

This is huge database (200.000 records)...

Kara
  • 6,115
  • 16
  • 50
  • 57
TomSawyer
  • 3,711
  • 6
  • 44
  • 79
  • Take a look at this SO question: http://stackoverflow.com/questions/4777070/hamming-distance-on-binary-strings-in-sql – Bjoern Oct 15 '13 at 04:13
  • @Bjoern can you help me to complete mysql query? i already read it, but i still don't know how to make a query – TomSawyer Oct 15 '13 at 04:15
  • Well, your select query will look something like `SELECT HUMMINGDISTANCE(some_parameter) FROM yourtable;`, if you adapt the function provided there. The author converts binary strings to big integers for performance, so you should adapt this while feeding the function with your parameters. He also uses 32 bytes, you have take that into consideration with your binary values. – Bjoern Oct 15 '13 at 04:23

1 Answers1

1
SELECT * FROM MY_TABLE ORDER BY BIT_COUNT(CAST(CONV(record,2,10) as unsigned integer) ^ CAST(b'11...0' as unsigned integer)) LIMIT 1;

The above query will return the most similar record.

You can also SELECT the BIT_COUNT, it's min=0 means identity (record=input) or 100%, it's max=64 means that all bits differ (record = ~input) or 0%.

P. B. M.
  • 262
  • 2
  • 6
  • just tried your query, but i don't see the order by relevance here. If field type is BIGINT, i can't store full string to database. i'm using varchar. Here is my query: SELECT * FROM files ORDER BY BIT_COUNT( hash ^1110000011110000111010001110111011110000111100001100000011000000 ) LIMIT 0 , 30 – TomSawyer Oct 15 '13 at 04:56
  • Well, it works only on integers, you will have to convert records to BIGINT and your input value to a long. – P. B. M. Oct 15 '13 at 05:17
  • You can try to replace record with BINARY(record) and select your input as b'1100...'. But I'm not sure whether BINARY conversion will work. – P. B. M. Oct 15 '13 at 05:28
  • okay, try this: ORDER BY BIT_COUNT(CAST(CONV(record,2,10) as unsigned integer) ^ CAST(b'11...0' as unsigned integer)); where "record" is your varchar. – P. B. M. Oct 15 '13 at 06:23
  • many thanks. it works like a charm now. so, if i have large database (200k records) this query with varchar field type will affect to my performance? or is the faster way to query it? – TomSawyer Oct 16 '13 at 13:09
  • Hi @TomSawyer thanks for checking it out. You can speed it up by introducing a new field to your table of type unsigned bigint and setting it beforehand to the left-hand side of the bitwise XOR operation (the one with CAST, CONV of your varchar). Also on the right-hand side, depending on the way (prog. language etc.) you use to access your DB, and the way you obtain the search value, you can replace it with an unsigned long (8 bytes). Then you end up with BIT_COUNT(mybigintfld ^ input). BTW, what's your access time now? – P. B. M. Oct 17 '13 at 02:17