Most efficient way to count duplicated elements between two arrays

Question

As part of a very basic program I am writing in Ruby, I am trying to find the total number of shared elements between two arrays of equal length, but I need to include repeats.

My current example code for this situation is as follows:

array_a = ["B","A","A","A","B"]
array_b = ["A","B","A","B","B"]
counter = 0
array_a.each_index do |i|
  array_a.sort[i] == array_b.sort[i]
    counter += 1
  end
end
puts counter

I want the return value of this comparison in this instance to be 4, and not 2, as the two arrays share 2 duplicate characters ("A" twice, and "B" twice). This seems to work, but I am wondering if there are any more efficient solutions for this issue. Specifically whether there are any methods you would suggest looking into. I spoke with someone who suggested a different method, inject, but I really don't understand how that applies and would like to understand. I did quite a bit of reading on uses for it, and it still isn't clear to me how it is appropriate. Thank you.

Looking at my code, I have realized that it doesn't seem to work for the situation that I am describing.

It appears we will never know if the asker derived any benefit from the answers... — Cary Swoveland, Sep 25 '20 at 20:06

score 1 · Answer 1 · edited May 23 '17 at 12:23

If I understand the question correctly, you could do the following.

Code

def count_shared(arr1, arr2)
  arr1.group_by(&:itself).
       merge(arr2.group_by(&:itself)) { |_,ov,nv| [ov.size, nv.size].min }.
       values.
       reduce(0) { |t,o| (o.is_a? Array) ? t : t + o }
end

Examples

arr1 = ["B","A","A","A","B"]
arr2 = ["A","B","A","B","B"]

count_shared(arr1, arr2)
  #=> 4 (2 A's + 2 B's)

arr1 = ["B", "A", "C", "C", "A", "A", "B", "D", "E", "A"]
arr2 = ["C", "D", "F", "F", "A", "B", "A", "B", "B", "G"]

count_shared(arr1, arr2)
  #=> 6 (2 A's + 2 B's + 1 C + 1 D + 0 E's + 0 F's + 0 G's)

Explanation

The steps are as follows for a slightly modified version of the first example.

arr1 = ["B","A","A","A","B","C","C"]
arr2 = ["A","B","A","B","B","D"]

First apply Enumerable#group_by to both arr1 and arr2:

h0 = arr1.group_by(&:itself)
  #=> {"B"=>["B", "B"], "A"=>["A", "A", "A"], "C"=>["C", "C"]} 
h1 = arr2.group_by(&:itself)
  #=> {"A"=>["A", "A"], "B"=>["B", "B", "B"], "D"=>["D"]}

Prior to Ruby v.2.2, when Object#itself was introduced, you would have to write:

arr.group_by { |e| e }

Continuing,

h2 = h0.merge(h1) { |_,ov,nv| [ov.size, nv.size].min }
  #=> {"B"=>2, "A"=>2, "C"=>["C", "C"], "D"=>["D"]}

I will return shortly to explain the above calculation.

a = h2.values
  #=> [2, 2, ["C", "C"], ["D"]] 
a.reduce(0) { |t,o| (o.is_a? Array) ? t : t + o }
  #=> 4

Here Enumerable#reduce (aka inject) merely sums the values of a that are not arrays. The arrays correspond to elements of arr1 that do not appear in arr2 or vise-versa.

As promised, I will now explain how h2 is computed. I've used the form of Hash#merge that employs a block (here { |k,ov,nv| [ov.size, nv.size].min }) to compute the values of keys that are present in both hashes being merged. For example, when the first key-value pair of h1 ("A"=>["A", "A"]) is being merged into h0, since h0 also has a key "A", the array

["A", ["A", "A", "A"], ["A", "A"]]

is passed to the block and the three block variables are assigned values (using "parallel assignment", which is sometimes called "multiple assignment"):

k, ov, nv = ["A", ["A", "A", "A"], ["A", "A"]]

so we have

k  #=> "A" 
ov #=> ["A", "A", "A"] 
nv #=> ["A", "A"]

k is the key, ov ("old value") is the value of "A" in h0 and nv ("new value") is the value of "A" in h1. The block calculation is

[ov.size, nv.size].min
  #=> [3,2].min = 2

so the value of "A" is now 2.

Notice that the key, k, is not used in the block calculation (which is very common when using this form of merge). For that reason I've changed the block variable from k to _ (a legitimate local variable), both to reduce the chance of introducing a bug and to signal to the reader that the key is not used in the block. The other elements of h2 that use this block are computed similarly.

Another way

It would be quite simple if we had available an Array method I've proposed be added to the Ruby core:

array_a = ["B","A","A","A","B"]
array_b = ["A","B","A","B","B"]

array_a.size - (array_a.difference(array_b)).size
  #=> 4

or

array_a.size - (array_b.difference(array_a)).size
  #=> 4

I've cited other applications in my answer here.

JLB · Answer 2 · 2016-03-25T20:52:55.097

Allow me to reiterate and explain what I think the OP's original intent was:

Given arrays of equal size

array_a = ["B","A","A","A","B"]
array_b = ["A","B","A","B","B"]

We need to show the total number of matching pairs of elements between the two arrays. In other words, each B in array_a will "use up" a B in array_b, and the same will be true for each A. As there are two B's in array_a and three in array_b, this leaves us with a count of 2 for B, and following the same logic, 2 for A, for a sum of 4.

(array_a & array_b).map { |e| [array_a.count(e), array_b.count(e)].min }.reduce(:+)

If we get the intersection of the arrays with &, the result is a list of values that exist in both arrays. We then iterate over each match, and select the minimum number of times the element exists in either array --- this is the most number of times the element that can be "used". All that is left is to total the number of paired elements, with reduce(:+)

Changing array_a to ["B", "A", "A", "B", "B"] results in a total of 5, as there are now enough of B to exhaust the supply of B in array_b.

tadman · Answer 3 · 2016-03-25T00:05:12.393

0

This is a perfect job for Enumerable#zip and Enumerable#count:

array_a.zip(array_b).count do |a, b|
  a == b
end
# => 2

The zip method pairs up elements, "zippering" them together, and the count method can take a block as to if the element should be counted.

The inject method is very powerful, but it's also the most low-level. Pretty much every other Enumerable method can be created with inject if you work at it, so it's quite flexible, but usually a more special-purpose method is better suited. It's still a useful tool if applied correctly.

In this case zip and count do a much better job and if you know what these methods do, this code is self explanatory.

Update:

If you need to count all overlapping letters regardless of order you need to do some grouping on them. Ruby on Rails provides the handy group_by method in ActiveSupport, but in pure Ruby you need to make your own.

Here's an approach that counts up all the unique letters, grouping them using chunk:

# Convert each array into a map like { "A" => 2, "B" => 3 }
# with a default count of 0.
counts = [ array_a, array_b ].collect do |a|
  Hash.new(0).merge(
    Hash[a.sort.chunk { |v| v }.collect { |k, a| [ k, a.length ] }]
  )
end

# Iterate over one of the maps key by key and count the minimum
# overlap between the two.
counts[0].keys.inject(0) do |sum, key|
  sum + [ counts[0][key], counts[1][key] ].min
end

edited Mar 25 '16 at 00:05

answered Mar 24 '16 at 23:33

tadman

208,517
23
234
262

Thank you for your response! The only issue I see with this is that I need to include duplicates in the total shared elements as well, so that the return value of this comparison in this instance would be 4, and not 2, as the two arrays share 2 duplicate characters ("A" twice, and "B" twice). Should've been clearer in my initial post, and edited to clarify that. – 84130555 Mar 24 '16 at 23:45
So AAABB and AABBB match with 4, but ABBCC and AAABB would match with 3? – tadman Mar 24 '16 at 23:51
Yes. It looks like my own current code doesn't work for that situation either. – 84130555 Mar 24 '16 at 23:57
That's a bit uglier, but hopefully the example in my updated answer helps. The first part is kind of a mess but it makes the second part pretty straightforward. – tadman Mar 25 '16 at 00:05
[group_by](http://ruby-doc.org/core-2.1.0/Enumerable.html#method-i-group_by) is a plain Ruby method. – steenslag Mar 25 '16 at 14:37
@steenslag Oh, nice! It got promoted to be one. For some reason I didn't see it in the 2.3 docs that I was looking at, otherwise I would've used that. – tadman Mar 25 '16 at 21:27

Most efficient way to count duplicated elements between two arrays

3 Answers3

Linked