I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?
all = [*1..35000, 1]
This solution works:
all.select { |v| all.count(v) > 1 }
But it takes a long time.
I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?
all = [*1..35000, 1]
This solution works:
all.select { |v| all.count(v) > 1 }
But it takes a long time.
Your code is taking an eon to execute because it is executing count
for each element, resulting in it having a computational complexity of O(n2).
arr = [*1..35000, 1, 34999]
If you want to know which values appear in the array at least twice...
require 'set'
uniq_set = Set.new
arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
#=> [1, 34999]
Set lookups (implemented with a hash under the covers) are extremely fast.
If you want to know the numbers of times values appear in the array that appear at least twice...
arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
#=> {1=>2, 34999=>2}
This uses a counting hash1. See Hash::new when it takes a default value as an argument.
If you want to know the indices of values that appear in the array at least twice...
arr.each_with_index.
with_object({}) { |(x,i),h| (h[x] ||= []) << i }.
select { |_,v| v.size > 1 }
#=> {1=>[0, 35000], 34999=>[34998, 35001]}
When the hash h
does not already have a key x
,
(h[x] ||= []) << i
#=> (h[x] = h[x] || []) << i
#=> (h[x] = nil || []) << i
#=> (h[x] = []) << i
#=> [] << i where [] is now h[x]
1. Ruby v2.7 gave us the method Enumerable#tally, allowing us to write arr.tally.select { |_,v| v > 1 }
.
If you're on Ruby 2.4.0+, you can use group_by
+ Hash#transform_values
(available Ruby 2.4.0):
all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }
See it in action:
all.group_by(&:itself)
groups how many times an element appears:
{
1 => [1, 1],
2 => [2],
...
35000 => [35000]
}
Then let's turn the values of above hash into frequency:
all.group_by(&:itself).transform_values(&:size)
:
{
1 => 2,
2 => 1,
...
35000 => 1
}
Benchmark:
def measure_execution_time
started = Time.now
yield
Time.now - started
end
measure_execution_time do
all.select { |value| all.count(value) > 1 }
end
=> 7.235489
measure_execution_time do
all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }
end
=> 0.017887
Cary’s solution seems to be the fastest here so far. Here is mine:
large_array.sort.each_cons(2).with_object(Set.new) do |(e1, e2), acc|
acc << e1 if e1 == e2
end.to_a
NB: unlike Cary’s solution, and like juanitofatas’s one, this might be easily adapted to find those having more than N
occurrences (just change the parameter to each_cons
.
Also, if the original array might be mutated, this consumes the least amount of memory amongst others (sort
→ sort!
.)
Benchmarks:
require 'set'
require 'benchmark'
def mudsie arr
arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
acc << e1 if e1 == e2
end.to_a
end
def cary arr
uniq_set = Set.new
arr.each_with_object(Set.new) do |x,dup_set|
uniq_set.add?(x) || dup_set.add(x)
end.to_a
end
def juanitofatas arr
arr.group_by(&:itself).transform_values(&:size).select do |_, freq|
freq > 1
end.keys
end
arr = [*(1..35000)]
arr = (arr + arr.sample(500)).shuffle
n = 500
Benchmark.bm do |x|
x.report("juanitofatas") { n.times { juanitofatas arr } }
x.report("cary") { n.times { cary arr } }
x.report("mudsie") { n.times { mudsie arr } }
end
user system total real
juanitofatas 4.321030 0.000000 4.321030 ( 4.321232)
cary 3.229409 0.032003 3.261412 ( 3.261995)
mudsie 3.798093 0.000000 3.798093 ( 3.798141)
Can't you just count them yourself?
counter = Hash.new(0)
array.each { |e| counter[e] += 1 }
counter.select! { |k, v| v > 1 }
counter.keys
# OR
array.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
.select! { |k, v| v > 1 }
.keys
It has O(n) complexity and I don't think you can go faster than that (I mean faster than O(n))
Great replies!
My version inspired by Cary Swoveland but more verbose, just one line less than Nondv with select
instead of select!
, which seems even faster:
def igian(arr)
h = Hash.new(0)
arr.each { |a| h[a] += 1 }
h.select { |_k, v| v > 1 }
end
Benchmark on all replies, inspired by mudasobwa.
n = 500
user system total real
cary1 5.040000 0.200000 5.240000 ( 5.248103)
cary2 4.700000 0.190000 4.890000 ( 4.911883)
juanito 7.430000 0.030000 7.460000 ( 7.483123)
mudsie 5.430000 0.020000 5.450000 ( 5.460839)
nondv1 4.720000 0.190000 4.910000 ( 4.924792)
nondv2 5.110000 0.190000 5.300000 ( 5.317148)
igian 4.310000 0.190000 4.500000 ( 4.522211)
n = 1000
user system total real
cary1 10.460000 0.410000 10.870000 ( 10.900927)
cary2 9.550000 0.410000 9.960000 ( 9.989021)
juanito 15.370000 0.160000 15.530000 ( 15.569288)
mudsie 10.920000 0.020000 10.940000 ( 10.972357)
nondv1 9.590000 0.410000 10.000000 ( 10.017669)
nondv2 10.340000 0.410000 10.750000 ( 10.774538)
igian 8.790000 0.400000 9.190000 ( 9.213292)
Here is the benchmark code:
require 'benchmark'
require 'set'
arr = [*1..35000, 1]
def cary1(arr)
uniq_set = Set.new
arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
end
def cary2(arr)
arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
end
def juanito(arr)
arr.group_by(&:itself).transform_values(&:size).select { |_,v| v > 1 }
end
def mudsie(arr)
arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
acc << e1 if e1 == e2
end.to_a
end
def nondv1(arr)
counter = Hash.new(0)
arr.each { |e| counter[e] += 1 }
counter.select! { |k, v| v > 1 }
counter.keys
end
def nondv2(arr)
arr.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
.select! { |k, v| v > 1 }
.keys
end
def igian(arr)
h = Hash.new(0)
arr.each { |a| h[a] += 1 }
h.select { |_k, v| v > 1 }
end
n = 500 #1000
Benchmark.bm do |x|
x.report("cary1") { n.times { cary1 arr } }
x.report("cary2") { n.times { cary2 arr } }
x.report("juanito") { n.times { juanito arr } }
x.report("mudsie") { n.times { mudsie arr } }
x.report("nondv1") { n.times { nondv1 arr } }
x.report("nondv2") { n.times { nondv2 arr } }
x.report("igian") { n.times { igian arr } }
end