Suppose we are given a non-negative integer max_diff
such that two integers, n1
and n2
are regarded as "similar" if
|n1 - n2| <= max_diff
Further suppose that we are given the following array of hashes:
arr = [
{ "id"=>11486, "opt_ids"=> [3545, 3546, 3548, 3550] },
{ "id"=>12624, "opt_ids"=> [3545, 3396, 3548, 3550] },
{ "id"=>14588, "opt_ids"=> [3394, 3396, 3397, 3399] },
{ "id"=>14589, "opt_ids"=> [3394, 3545, 3398, 3548] },
{ "id"=>14590, "opt_ids"=> [3394, 3396, 3397, 3399] },
{ "id"=>14591, "opt_ids"=> [3545, 3547, 3548, 3551] },
{ "id"=>14592, "opt_ids"=> [3653, 3655, 3657, 3660] },
{ "id"=>14593, "opt_ids"=> [3772, 3775, 3777, 3778] }
]
and specify:
max_diff = 3
I will assume that we are to rank pairs of these hashes (and hence associated pairs of values of "id"
) by a count of the number of pairs of elements [h1["opt_ids"][i], h2["opt_ids"][j]]
that are "similar" in terms of my definition above. This may not coincide with what you have in mind, but it may be a useful starting point. I briefly address the case where max_diff
is always zero.
Consider the values of "opt_ids"
for the hashes arr[0]
and arr[3]
:
a0 = arr[0]["opt_ids"]
#=> [3545, 3546, 3548, 3550]
a3 = arr[3]["opt_ids"]
#=> [3394, 3545, 3398, 3548]
As indicated in the table below, 3545
in a0
is similar to two elements of a3
, 3545
and 3548
. Similarly, 3546
, 3548
and 3550
in a0
are similar to 2
, 2
and 1
elements of a3
, repectively, meaning that the hashes arr[0]
and arr[3]
can be thought of as having a similarity score of 7
.
a0 a3 #
____________________
3545: 3545, 3548 (2)
3546: 3545, 3548 (2)
3548: 3545, 3548 (2)
3550: 3548 (1)
We may implement this counting with the following helper method.
def count_similar(a1, a2, max_diff)
a1.product(a2).count { |n1,n2| (n1-n2).abs <= max_diff }
end
For example,
count_similar(a0, a3, max_diff)
#=> 7
If I've misunderstood the question and max_diff
is always zero, and if all the arrays arr[i]["opt_ids"]
contain unique elements (as they do in the example in the question), one could write
def count_similar(a1, a2)
(a1 & a2).size
end
and drop the max_diff
argument.
We may now compute similarity measures for pairs of ids as follows:
def similarities(arr, max_diff)
arr.combination(2)
.map do |h1,h2|
[[h1["id"], h2["id"]],
count_similar(h1["opt_ids"], h2["opt_ids"], max_diff)]
end.to_h
end
similarities(arr, max_diff)
#=> {[11486, 12624]=> 9, [11486, 14588]=>0, [11486, 14589]=> 7, [11486, 14590]=>0,
# [11486, 14591]=>13, [11486, 14592]=>0, [11486, 14593]=> 0, [12624, 14588]=>4,
# [12624, 14589]=> 7, [12624, 14590]=>4, [12624, 14591]=>10, [12624, 14592]=>0,
# [12624, 14593]=> 0, [14588, 14589]=>6, [14588, 14590]=>14, [14588, 14591]=>0,
# [14588, 14592]=> 0, [14588, 14593]=>0, [14589, 14590]=> 6, [14589, 14591]=>7,
# [14589, 14592]=> 0, [14589, 14593]=>0, [14590, 14591]=> 0, [14590, 14592]=>0,
# [14590, 14593]=> 0, [14591, 14592]=>0, [14591, 14593]=> 0, [14592, 14593]=>0}
We might then compute measures such the determining the five pairs having the highest similarities scores:
similarities(arr, max_diff).max_by(5, &:last)
#=> [[[14588, 14590], 14], [[11486, 14591], 13], [[12624, 14591], 10],
# [[11486, 12624], 9], [[12624, 14589], 7]]
This shows that the pair of ids, 14588
and 14590
(from arr[2]
and arr[4]
) have the highest similarity score, of 14
.