Efficient way of matching 2 hashes having maximum similarity

Question

[
  {"id"=>11486, "opt_ids"=> [3545, 3546, 3548, 3550] },
  {"id"=>12624, "opt_ids"=> [3545, 3396, 3548, 3550] }, 
  {"id"=>14588, "opt_ids"=> [3394, 3396, 3397, 3399] }, 
  {"id"=>14589, "opt_ids"=> [3394, 3545, 3398, 3548] }, 
  {"id"=>14590, "opt_ids"=> [3394, 3396, 3397, 3399, 3545, 3547, 3548, 3551, 3653, 3655, 3657, 3660, 3772, 3775, 3777, 3778]},
  .....
  .....
  ...
  ...

]

Is there an efficient way of finding 2 id's which would have the maximum number of similar option_ids?

The answer to the above example would be

[[11486, 12624], [14588, 14590]]

What I have already tried is -

Take each hash and compare its opt_ids to opt_ids of other remaining hashes in the array.
The hash with which the opt_ids of the current hash match the most I pair those 2 ids.
So I am literally looping over each hash as many times as there are number of hashes in the array - O(n^2)

What would the answer be for the example above? And what have you tried? "Looping" is vague. — Schwern, Feb 08 '21 at 05:10
@Schwern Thank you for asking, I have updated my question. Please let me know if it's clearer now? — Gautam, Feb 08 '21 at 05:55
Yes, thank you. How much data do you have, and how often does it change? What kind of performance are you looking for? It might make sense to put it into SQLite. — Schwern, Feb 08 '21 at 06:07
@Schwern This data is already a part of the database and I have aggregated it from there. The number of hashes in an array can vary between 50 to 500 for now. The performance I am looking for is that to match 2 id's based on maximum similar option_ids I don't have to loop through each hash. The number of option_ids will increase each week. — Gautam, Feb 08 '21 at 06:11
It would probably be easier and faster to do it in SQL. Could you show the original tables? — Schwern, Feb 08 '21 at 06:13
@Schwern http://sqlfiddle.com/#!9/ece591/3 - I have put the table here with the same data as above. I have created a view from many different tables. — Gautam, Feb 08 '21 at 06:31
O(n²) doesn't look too bad, actually. What you could try: replace the arrays by sets, in order to avoid conversion every time you calculate `opt_ids1 & opt_ids2`. You could reverse the data structure : for each `opt_id`, you create a set of `ids` containing the opt_id. I'm not sure what the next step could be, though. — Eric Duminil, Feb 08 '21 at 09:26
@Gautam What do you want to do with duplicates? You have 14590, 3548 in there twice. — Schwern, Feb 08 '21 at 19:21
I suggest checking out https://stackoverflow.com/questions/9606492/hamming-distance-similarity-searches-in-a-database/47487949#47487949 — Kevin, Jul 13 '23 at 01:35

score 2 · Accepted Answer · answered Feb 08 '21 at 19:38

sqlfiddle I have put the table here with the same data as above. I have created a view from many different tables.

Do it in SQL, this is what it's good at.

Use a self-join to get the number of overlaps for each pair.

select
  a.emp_id emp_id1,
  b.emp_id emp_id2,
  count(a.option_id) as overlap
from data a
join data b on
  -- Ensure we count each pair only once
  a.emp_id < b.emp_id and
  a.option_id = b.option_id
group by a.emp_id, b.emp_id

Then use that as a CTE to select the pairs with the most overlaps.

with overlaps as (
  select
    a.emp_id emp_id1,
    b.emp_id emp_id2,
    count(a.option_id) as overlap
  from data a
  join data b on
    a.emp_id < b.emp_id and
    a.option_id = b.option_id
  group by a.emp_id, b.emp_id
)
select *
from overlaps
where overlap = (
  select max(overlap)
  from overlaps
)

So long as you're indexed, that should perform much better than pulling all the data out into Ruby.

create index idx_option_emp_ids on data(option_id, emp_id);

Even without indexes, it should perform much better than pulling it all out into Ruby.

Thank you for your answer, let me analyse it and get back. – Gautam Feb 09 '21 at 08:32 — Gautam, Feb 09 '21 at 08:32

Cary Swoveland · Answer 2 · 2021-02-09T09:05:04.727

Suppose we are given a non-negative integer max_diff such that two integers, n1 and n2 are regarded as "similar" if

|n1 - n2| <= max_diff

Further suppose that we are given the following array of hashes:

arr = [
  { "id"=>11486, "opt_ids"=> [3545, 3546, 3548, 3550] },
  { "id"=>12624, "opt_ids"=> [3545, 3396, 3548, 3550] }, 
  { "id"=>14588, "opt_ids"=> [3394, 3396, 3397, 3399] }, 
  { "id"=>14589, "opt_ids"=> [3394, 3545, 3398, 3548] }, 
  { "id"=>14590, "opt_ids"=> [3394, 3396, 3397, 3399] },
  { "id"=>14591, "opt_ids"=> [3545, 3547, 3548, 3551] },
  { "id"=>14592, "opt_ids"=> [3653, 3655, 3657, 3660] },
  { "id"=>14593, "opt_ids"=> [3772, 3775, 3777, 3778] }
]

and specify:

max_diff = 3

I will assume that we are to rank pairs of these hashes (and hence associated pairs of values of "id") by a count of the number of pairs of elements [h1["opt_ids"][i], h2["opt_ids"][j]] that are "similar" in terms of my definition above. This may not coincide with what you have in mind, but it may be a useful starting point. I briefly address the case where max_diff is always zero.

Consider the values of "opt_ids" for the hashes arr[0] and arr[3]:

a0 = arr[0]["opt_ids"]
  #=> [3545, 3546, 3548, 3550]
a3 = arr[3]["opt_ids"]
  #=> [3394, 3545, 3398, 3548]

As indicated in the table below, 3545 in a0 is similar to two elements of a3, 3545 and 3548. Similarly, 3546, 3548 and 3550 in a0 are similar to 2, 2 and 1 elements of a3, repectively, meaning that the hashes arr[0] and arr[3] can be thought of as having a similarity score of 7.

  a0      a3      #
____________________
3545: 3545, 3548 (2)
3546: 3545, 3548 (2)
3548: 3545, 3548 (2)
3550: 3548       (1)

We may implement this counting with the following helper method.

def count_similar(a1, a2, max_diff)
  a1.product(a2).count { |n1,n2| (n1-n2).abs <= max_diff }
end

For example,

count_similar(a0, a3, max_diff)
  #=> 7

If I've misunderstood the question and max_diff is always zero, and if all the arrays arr[i]["opt_ids"] contain unique elements (as they do in the example in the question), one could write

def count_similar(a1, a2)
  (a1 & a2).size
end

and drop the max_diff argument.

We may now compute similarity measures for pairs of ids as follows:

def similarities(arr, max_diff)
  arr.combination(2)
     .map do |h1,h2|
       [[h1["id"], h2["id"]],
        count_similar(h1["opt_ids"], h2["opt_ids"], max_diff)]
      end.to_h
end

similarities(arr, max_diff)
  #=> {[11486, 12624]=> 9, [11486, 14588]=>0, [11486, 14589]=> 7, [11486, 14590]=>0,
  #    [11486, 14591]=>13, [11486, 14592]=>0, [11486, 14593]=> 0, [12624, 14588]=>4,
  #    [12624, 14589]=> 7, [12624, 14590]=>4, [12624, 14591]=>10, [12624, 14592]=>0,
  #    [12624, 14593]=> 0, [14588, 14589]=>6, [14588, 14590]=>14, [14588, 14591]=>0,
  #    [14588, 14592]=> 0, [14588, 14593]=>0, [14589, 14590]=> 6, [14589, 14591]=>7,
  #    [14589, 14592]=> 0, [14589, 14593]=>0, [14590, 14591]=> 0, [14590, 14592]=>0,
  #    [14590, 14593]=> 0, [14591, 14592]=>0, [14591, 14593]=> 0, [14592, 14593]=>0}

We might then compute measures such the determining the five pairs having the highest similarities scores:

similarities(arr, max_diff).max_by(5, &:last)
  #=> [[[14588, 14590], 14], [[11486, 14591], 13], [[12624, 14591], 10],
  #    [[11486, 12624],  9], [[12624, 14589],  7]]

This shows that the pair of ids, 14588 and 14590 (from arr[2] and arr[4]) have the highest similarity score, of 14.

Thank you for guiding me in this direction, though a bit different it helped me have another perspective @cary — Gautam, Feb 15 '21 at 13:22

score -1 · Answer 3 · edited Feb 08 '21 at 06:34

-1

You can check for difference in arrays. So, if it returns empty array or say less number of elements that means they are equal or almost equal.

for ex:

a = [1,2,3,4,5]
b = [1,3,2,5,4]
c = [2,4,3,5,9]
d = [2,7,8,8,9]

a-b = []            // similar
a-c = [1]           // almost similar
a-d = [1,3,4,5]     // least similar

edited Feb 08 '21 at 06:34

Dharman

30,962
25
85
135

answered Feb 08 '21 at 06:28

user681

193
1
11

Correct, but I would still have to compare a with b,c and d one by one. Trying to avoid exactly that. – Gautam Feb 08 '21 at 06:32
1

The arrays have different lengths, though. So an empty difference doesn't mean anything : `a - b # => []` could simply mean that `a` is empty. Set intersection (`&`) might be a better idea. – Eric Duminil Feb 08 '21 at 09:12

Efficient way of matching 2 hashes having maximum similarity

3 Answers3