2

This is a very simple question; which items appear in the list more than once?

array = ["mike", "mike", "mike", "john", "john", "peter", "clark"]

The correct answer is ["mike", "john"].

Seems like we can just do:

array.select{ |e| ary.count(e) > 1 }.uniq

Problems solved. But wait! What if the array is REALLY big:

1_000_000.times { array.concat("1234567890abcdefghijklmnopqrstuvwxyz".split('')) }

It just so happens I need to figure out how to do this in a reasonable amount of time. We're talking millions and millions of records.

For what it's worth, this massive array is actually a sum of 10-20 smaller arrays. If it's easier to compare those, let me know - I'm stumped.

We're talking 10,000 to 10,000,000 lines per file, hundreds of files.

dsp_099
  • 5,801
  • 17
  • 72
  • 128
  • 1
    Bit of a brainstorm, but what if we fed those values into a hash table, then if they start conflicting, we assume that it's a duplicate? – S.V. Aug 19 '16 at 11:19
  • Brainstorm continued: The assumed 32-bit hashes of 1,000,000,000 entries would require a table of up to four gigabyte. This could be boiled down to 116 megabyte by using a bitset where each bit represents an occupied hash. This approach would require two passes through the data to filter out potential duplicates. – Axel Kemper Aug 19 '16 at 11:35
  • What is an acceptable runtime for your use case? Not sure I understand the comparison of smaller arrays note: This would only return the same result if you are sure that each array contains items that do not appear in the other arrays. – Pascal Aug 19 '16 at 11:59
  • I can't get your avatar out of my mind. – Cary Swoveland Aug 21 '16 at 07:25
  • @CarySwoveland Hey Cary, I emailed you a while back for Ruby help. I imagine my avatar sings as a tenor. What do you think? – dsp_099 Aug 21 '16 at 09:10
  • [Hmmm](http://www.findingdulcinea.com/docroot/dulcinea/fd_images/features/profiles/p/luciano-pavarotti/features/0/image.jpg). – Cary Swoveland Aug 21 '16 at 15:01

2 Answers2

2

Does something like

items = 30_000_000

array = items.times.map do
  rand(10_000_000)
end

puts "Done with seeding"
puts
puts "Checking what items appear more than once. Size: #{array.size}"
puts

t1 = Time.now
def more_than_once(array)
  counts = Hash.new(0)
  array.each do |item|
    counts[item] += 1
  end

  counts.select do |_, count|
    count > 1
  end.keys
end

res = more_than_once(array)
t2 = Time.now


p res.size
puts "Took #{t2 - t1}"

work for you?

The duration is about 40s on my machine.

Pascal
  • 8,464
  • 1
  • 20
  • 31
1

Here are two more solutions with a benchmark comparison of these and @Pascal's methods.

Use sets

require 'set'

def multi_set(arr)
  s1 = Set.new
  arr.each_with_object(Set.new) { |e, smulti| smulti.add(e) unless s1.add?(e) }.to_a
end

arr = ["mike", "mike", "mike", "john", "john", "peter", "clark"]    
multi(arr)
  #=> ["mike", "john"]

s1 is being built to include all distinct elements of arr. s1.add?(e) returns nil if s1 already contains e, in which case e is added to smulti if smulti does not already contain that element. (See Set#add?.) smulti is returned by the method.

Use Array#difference

Array#difference is a method I've proposed be added to Ruby's core. See also my answer here.

class Array
  def difference(other)
    h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
    reject { |e| h[e] > 0 && h[e] -= 1 }
  end
end

def multi_difference(arr)
  arr.difference(arr.uniq).uniq
end

Benchmark

def more_than_once(arr)
  counts = Hash.new { |hash, key| hash[key] = 0 }
  arr.each do |item|
    counts[item] += 1
  end
  counts.select do |_, count|
    count > 1
  end.keys
end

require 'fruity'

items = 30_000_000
arr = items.times.map { rand 10_000_000 }

compare do 
  Pascal     { more_than_once(arr) }
  Set        { multi_set(arr) }
  Difference { multi_difference(arr) }
end

Running each test once. Test will take about 4 minutes.
Pascal is faster than Set by 19.999999999999996% ± 10.0%
Set is faster than Difference by 30.000000000000004% ± 10.0%

Of course, difference, if part of the Ruby core, would be coded in C and optimized.

Community
  • 1
  • 1
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100