How to find which items in a MASSIVE array appear more than once?

Question

This is a very simple question; which items appear in the list more than once?

array = ["mike", "mike", "mike", "john", "john", "peter", "clark"]

The correct answer is ["mike", "john"].

Seems like we can just do:

array.select{ |e| ary.count(e) > 1 }.uniq

Problems solved. But wait! What if the array is REALLY big:

1_000_000.times { array.concat("1234567890abcdefghijklmnopqrstuvwxyz".split('')) }

It just so happens I need to figure out how to do this in a reasonable amount of time. We're talking millions and millions of records.

For what it's worth, this massive array is actually a sum of 10-20 smaller arrays. If it's easier to compare those, let me know - I'm stumped.

We're talking 10,000 to 10,000,000 lines per file, hundreds of files.

Bit of a brainstorm, but what if we fed those values into a hash table, then if they start conflicting, we assume that it's a duplicate? — S.V., Aug 19 '16 at 11:19
Brainstorm continued: The assumed 32-bit hashes of 1,000,000,000 entries would require a table of up to four gigabyte. This could be boiled down to 116 megabyte by using a bitset where each bit represents an occupied hash. This approach would require two passes through the data to filter out potential duplicates. — Axel Kemper, Aug 19 '16 at 11:35
What is an acceptable runtime for your use case? Not sure I understand the comparison of smaller arrays note: This would only return the same result if you are sure that each array contains items that do not appear in the other arrays. — Pascal, Aug 19 '16 at 11:59
@CarySwoveland Hey Cary, I emailed you a while back for Ruby help. I imagine my avatar sings as a tenor. What do you think? — dsp_099, Aug 21 '16 at 09:10
[Hmmm](http://www.findingdulcinea.com/docroot/dulcinea/fd_images/features/profiles/p/luciano-pavarotti/features/0/image.jpg). — Cary Swoveland, Aug 21 '16 at 15:01

Pascal · Accepted Answer · 2016-08-21T09:03:57.830

2

Does something like

items = 30_000_000

array = items.times.map do
  rand(10_000_000)
end

puts "Done with seeding"
puts
puts "Checking what items appear more than once. Size: #{array.size}"
puts

t1 = Time.now
def more_than_once(array)
  counts = Hash.new(0)
  array.each do |item|
    counts[item] += 1
  end

  counts.select do |_, count|
    count > 1
  end.keys
end

res = more_than_once(array)
t2 = Time.now


p res.size
puts "Took #{t2 - t1}"

work for you?

The duration is about 40s on my machine.

edited Aug 21 '16 at 09:03

answered Aug 19 '16 at 11:56

Pascal

8,464
1
20
31

Love the default hash initialization ``counts = Hash.new { |hash, key| hash[key] = 0 }`` – zhon Aug 19 '16 at 15:45
You can simplify to `counts = Hash.new(0)`. – Cary Swoveland Aug 19 '16 at 22:49
And the counting could be made a one liner using each_with_object: counts = array.each_with_object(Hash.new) { |item, akku| akku[item] += 1 } – Pascal Aug 21 '16 at 09:36
You forgot `(0)` in `Hash.new(0)` in your last comment. – Cary Swoveland Aug 21 '16 at 19:21
Arghh. Thanks, again. – Pascal Aug 22 '16 at 07:34
This works, except for when the size of the data is higher than my RAM at which point this solution quits working :( – dsp_099 Aug 26 '16 at 10:41
In that case you either need to split the task or use some sort of external storage to store intermediate results. – Pascal Aug 26 '16 at 12:13

score 1 · Answer 2 · edited May 23 '17 at 12:33

Here are two more solutions with a benchmark comparison of these and @Pascal's methods.

Use sets

require 'set'

def multi_set(arr)
  s1 = Set.new
  arr.each_with_object(Set.new) { |e, smulti| smulti.add(e) unless s1.add?(e) }.to_a
end

arr = ["mike", "mike", "mike", "john", "john", "peter", "clark"]    
multi(arr)
  #=> ["mike", "john"]

s1 is being built to include all distinct elements of arr. s1.add?(e) returns nil if s1 already contains e, in which case e is added to smulti if smulti does not already contain that element. (See Set#add?.) smulti is returned by the method.

Use Array#difference

Array#difference is a method I've proposed be added to Ruby's core. See also my answer here.

class Array
  def difference(other)
    h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
    reject { |e| h[e] > 0 && h[e] -= 1 }
  end
end

def multi_difference(arr)
  arr.difference(arr.uniq).uniq
end

Benchmark

def more_than_once(arr)
  counts = Hash.new { |hash, key| hash[key] = 0 }
  arr.each do |item|
    counts[item] += 1
  end
  counts.select do |_, count|
    count > 1
  end.keys
end

require 'fruity'

items = 30_000_000
arr = items.times.map { rand 10_000_000 }

compare do 
  Pascal     { more_than_once(arr) }
  Set        { multi_set(arr) }
  Difference { multi_difference(arr) }
end

Running each test once. Test will take about 4 minutes.
Pascal is faster than Set by 19.999999999999996% ± 10.0%
Set is faster than Difference by 30.000000000000004% ± 10.0%

Of course, difference, if part of the Ruby core, would be coded in C and optimized.

How to find which items in a MASSIVE array appear more than once?

2 Answers2

Linked