3

I'm attempting to compare two arrays of hashes with very similar hash structure (identical and always-present keys) and return the deltas between the two--specifically, I'd like to capture the folllowing:

  • Hashes part of array1 that do not exist in array2
  • Hashes part of array2 that do not exist in array1
  • Hashes which appear in both data sets

This typically can be achieved by simply doing the following:

deltas_old_new = (array1-array2)
deltas_new_old = (array2-array1)

The problem for me (which has turned into a 2-3 hour struggle!) is that I need to identify the deltas based on the values of 3 keys within the hash ('id', 'ref', 'name')--the values of these 3 keys are effectively what makes up a unique entry in my data -- but I must retain the other key/value pairs of the hash (e.g. 'extra' and numerous other key/value pairs not shown for brevity.

Example Data:

array1 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

array2 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
          {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]

Expected Outcome (3 separate array of hashes):

Object containing data in array1 but not in array2 --

[{'id' => '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
 {'id' => '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

Object containing data in array2 but not in array1 --

[{'id' => '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
 {'id' => '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
 {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]

Object containing data in BOTH array1 and array2 --

[{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
 {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'}]

I've tried numerous attempts at comparing iterating over the arrays and using Hash#keep_if based on the 3 keys as well as merging both data sets into a single array and then attempting to de-dup based on array1 but I keep coming up empty handed. Thank you in advance for your time and assistance!

Kurt W
  • 321
  • 2
  • 15

3 Answers3

1

For this type of problem it's generally easiest to work with indices.

Code

def keepers(array1, array2, keys)
  a1 = make_hash(array1, keys)
  a2 = make_hash(array2, keys)
  common_keys_of_a1_and_a2 = a1.keys & a2.keys
  [keeper_idx(array1, a1, common_keys_of_a1_and_a2),
   keeper_idx(array2, a2, common_keys_of_a1_and_a2)]
end

def make_hash(arr, keys)
  arr.each_with_index.with_object({}) do |(g,i),h|
    (h[g.values_at(*keys)] ||= []) << i
  end
end

def keeper_idx(arr, a, common_keys_of_a1_and_a2)
  arr.size.times.to_a - a.values_at(*common_keys_of_a1_and_a2).flatten
end

Example

array1 =
  [{'id' =>  '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
   {'id' =>  '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
   {'id' =>  '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
   {'id' =>  '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 8'},
   {'id' =>  '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

array2 =
  [{'id' =>  '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
   {'id' =>  '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
   {'id' =>  '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
   {'id' =>  '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
   {'id' =>  '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 12'},
   {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]

Notice that the two arrays are slightly different than those given in the question. The question did not specify whether each array could contain two hashes the have the same values for the specified keys. I therefore added a hash to each array to show has that case is dealt with.

keys = ['id', 'ref', 'name']

idx1, idx2 = keepers(array1, array2, keys)
  #=> [[1, 4], [2, 3, 4, 5]]

idx1 (idx2) are the indices of the elements of array1 (array2) that remain after matches are removed. (array1 and array2 are not modified, however.)

It follows that the two arrays map to

array1.values_at(*idx1)
  #=> [{"id"=> "2", "ref"=>"1002", "name"=>"NY", "extra"=>"Not Sorted On 7"},
  #    {"id"=> "7", "ref"=>"1007", "name"=>"OR", "extra"=>"Not Sorted On 11"}]

and

array2.values_at(*idx2)
  #=> [{"id"=> "8", "ref"=>"1002", "name"=>"NY", "extra"=>"Not Sorted On 7"},
  #    {"id"=> "5", "ref"=>"1005", "name"=>"MT", "extra"=>"Not Sorted On 10"},
  #    {"id"=> "5", "ref"=>"1005", "name"=>"MT", "extra"=>"Not Sorted On 12"},
  #    {"id"=>"12", "ref"=>"1012", "name"=>"TX", "extra"=>"Not Sorted On 85"}]

The indices of the hashes that are removed are given as follows.

array1.size.times.to_a - idx1
  #=> [0, 2, 3]
array2.size.times.to_a - idx2
  #[0, 1]

Explanation

The steps are as follows.

a1 = make_hash(array1, keys)
  #=> {["1", "1001", "CA"]=>[0], ["2", "1002", "NY"]=>[1],
  #    ["3", "1003", "WA"]=>[2, 3], ["7", "1007", "OR"]=>[4]}    
a2 = make_hash(array2, keys)
  #=> {["1", "1001", "CA"]=>[0], ["3", "1003", "WA"]=>[1],
  #    ["8", "1002", "NY"]=>[2], ["5", "1005", "MT"]=>[3, 4],
  #    ["12", "1012", "TX"]=>[5]}
common_keys_of_a1_and_a2 = a1.keys & a2.keys
  #=> [["1", "1001", "CA"], ["3", "1003", "WA"]]
keeper_idx(array1, a1, common_keys_of_a1_and_a2)
  #=> [1, 4] (for array1)
keeper_idx(array2, a2, common_keys_of_a1_and_a2)
  #=> [2, 3, 4, 5]· (for array2)
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Kurt, I concluded my answer to the [question where you left your request](https://stackoverflow.com/questions/30494890/ruby-compare-two-arrays-of-hash-with-certain-keys/30495256?noredirect=1#comment78845124_30495256) is not the best for this problem, so I offered a different solution. – Cary Swoveland Aug 29 '17 at 23:06
  • Cary, you've once again outdone yourself! Thank you so very much for the detailed and explicit answer!! I so appreciate you taking the time and thought to help make it clear the flow of what is happening. Just a note that it might be better to move `a1`, `a2`, and `common_keys` into your main code answer; since it was in the explanation block it initially threw me for a loop as they were undefined. Thank you so very much once again! A million thanks! I'm not sure what the S.O. etiquette is for this--this answer more directly explains prob/solution; should I accept it this late in the game? – Kurt W Aug 30 '17 at 17:02
  • Also Cary, it seems that `array1.values_at(*array1.size.times.to_a - idx1).map { |h| h.select { |k,_| keys.include?(k) } }.uniq` produces the common values (seen in both `array1` and `array2`) but it only produces hashes for the named keys (`['id', 'ref', 'name']`). Shouldn't this show the entire hash with all of the key/value pairs--despite only calculating on those 3? Thanks again! – Kurt W Aug 30 '17 at 23:09
  • Thank you for both comments. Regarding #2, I removed that code, which was confusing and probably not of interest. I also change the name of `common_keys`, which does not refer to the keys of the individual elements of `array1` and `array2`, but of `a1` and `a2`. In comment #1, please clarify what you mean by "move `a1`, `a2` and `common_keys` into your main code answer." They are in `keepers`, which is the main method. – Cary Swoveland Aug 31 '17 at 03:30
0

See Array#- and Array#&

array1 - array2   #data in array1 but not in array2
array2 - array1   #data in array2 but not in array1
array1 & array2   #data in both array1 and array2

Since you've tagged this question you can use sets similarly:

require 'set'

set1 = array1.to_set
set2 = array2.to_set

set1 - set2
set2 - set1
set1 & set2
Sagar Pandya
  • 9,323
  • 2
  • 24
  • 35
  • this will not work for my data because I need to dedup based on specific hash keys present in both arrays ('id', 'ref', 'name'). I don't want to de-dup on ALL of the key/value pairs as it will lead to too many false positives with my data. Can you amend or withdraw your answer so this continues to remain visibile? Thanks in advance! – Kurt W Jul 26 '17 at 20:47
  • 1
    @KurtW I'll keep this answer up for a while, if only to serve as a warning to others to not post the same thing. Sorry I misunderstood your question. – Sagar Pandya Jul 26 '17 at 20:52
  • @Cary Swoveland, is your answer at https://stackoverflow.com/questions/30494890/ruby-compare-two-arrays-of-hash-with-certain-keys a perfect match for my question? Thank you again for all your help in recent months! – Kurt W Jul 26 '17 at 22:15
0

This isn't very pretty, but it works. It creates a third array containing all unique values in both array1 and array2 and iterates through that.

Then, since include? doesn't allow a custom matcher, we can fake it by using detect and looking for an item in the array which has the custom sub-hash matching. We'll wrap that in a custom method so we can just call it passing in array1 or array2 instead of writing it twice.

Finally, we loop through our array3 and determine whether the item came from array1, array2, or both of them and add to the corresponding output array.

array1 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '2', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '7', 'ref' => '1007', 'name' => 'OR', 'extra' => 'Not Sorted On 11'}]

array2 = [{'id' => '1', 'ref' => '1001', 'name' => 'CA', 'extra' => 'Not Sorted On 5'},
          {'id' => '3', 'ref' => '1003', 'name' => 'WA', 'extra' => 'Not Sorted On 9'},
          {'id' => '8', 'ref' => '1002', 'name' => 'NY', 'extra' => 'Not Sorted On 7'},
          {'id' => '5', 'ref' => '1005', 'name' => 'MT', 'extra' => 'Not Sorted On 10'},
          {'id' => '12', 'ref' => '1012', 'name' => 'TX', 'extra' => 'Not Sorted On 85'}]

# combine the arrays into 1 array that contains items in both array1 and array2 to loop through
array3 = (array1 + array2).uniq { |item| { 'id' => item['id'], 'ref' => item['ref'], 'name' => item['name'] } }

# Array#include? doesn't allow a custom matcher, so we can fake it by using Array#detect
def is_included_in(array, object)
  object_identifier = { 'id' => object['id'], 'ref' => object['ref'], 'name' => object['name'] }

  array.detect do |item|
    { 'id' => item['id'], 'ref' => item['ref'], 'name' => item['name'] } == object_identifier
  end
end

# output array initializing
array1_only = []
array2_only = []
array1_and_array2 = []

# loop through all items in both array1 and array2 and check if it was in array1 or array2
# if it was in both, add to array1_and_array2, otherwise, add it to the output array that
# corresponds to the input array
array3.each do |item|
  in_array1 = is_included_in(array1, item)
  in_array2 = is_included_in(array2, item)

  if in_array1 && in_array2
    array1_and_array2.push item
  elsif in_array1
    array1_only.push item
  else
    array2_only.push item
  end
end


puts array1_only.inspect        # => [{"id"=>"2", "ref"=>"1002", "name"=>"NY", "extra"=>"Not Sorted On 7"}, {"id"=>"7", "ref"=>"1007", "name"=>"OR", "extra"=>"Not Sorted On 11"}]
puts array2_only.inspect        # => [{"id"=>"8", "ref"=>"1002", "name"=>"NY", "extra"=>"Not Sorted On 7"}, {"id"=>"5", "ref"=>"1005", "name"=>"MT", "extra"=>"Not Sorted On 10"}, {"id"=>"12", "ref"=>"1012", "name"=>"TX", "extra"=>"Not Sorted On 85"}]
puts array1_and_array2.inspect  # => [{"id"=>"1", "ref"=>"1001", "name"=>"CA", "extra"=>"Not Sorted On 5"}, {"id"=>"3", "ref"=>"1003", "name"=>"WA", "extra"=>"Not Sorted On 9"}]
Simple Lime
  • 10,790
  • 2
  • 17
  • 32
  • Thank you so very much for taking the time to help out here! It's not as pretty as some things but I haven't seen a better solution for A -> B, B -> A, and the overlap. Curious, does this do what I want https://stackoverflow.com/questions/30494890/ruby-compare-two-arrays-of-hash-with-certain-keys ? Thanks either way, marked as accepted! – Kurt W Jul 29 '17 at 01:24
  • Kurt, `array1[0]` is presently not in the "Object containing data in `array1` but not in `array2`" because there is a hash in `array2` (`array2[0]`) whose values for the first three keys equal the values of the corresponding keys in `array1[0]`. I assume that would still be the case if `array2[0]['extra'] => 'Not Sorted On 99'` rather than "...`Sorted On 5`". If so, yes, I believe the other answer would apply here as well. – Cary Swoveland Aug 01 '17 at 00:57
  • @CarySwoveland, thanks much for getting back to me. Can you clarify that just a bit? Are you asking if `array1`/`array2` both have the `extra` key/value pair and if it should be considered in the calculation? If so, `extra` exists in both sets and has the same value in all cases (e.g. if 'id' => '2', 'ref' => '1002', 'name' => 'NY', then `extra` will always be equal to 'Not Sorted On 7'. Thanks for clarifying. Also, does Simple Lime's answer more closely meet my needs? If so, you seem to be the master of condensing things, maybe you have an improvement suggestion? Thank you again! – Kurt W Aug 01 '17 at 22:16
  • @KurtW Taking a quick look at Cary's other post, I had definitely forgotten about `values_at` which can be used instead of all the `{ 'id' => item['id'], 'ref' => item['ref'], 'name' => item['name'] }` hashes to check for uniqueness as a quick condensing move. I'll take a closer look at the rest of his post and see if there's anything else that might be used here easily to condense some of this – Simple Lime Aug 02 '17 at 03:52