7

I wrote a quick Python script to compare two files, each containing unordered hashes, in order to verify that both files are identical aside from order. Then I rewrote it in Ruby for educational purposes.

The Python implementation takes seconds, while the Ruby implementation takes around 4 minutes.

I have a feeling this is most likely due to my lack of Ruby knowledge, any ideas on what I am doing wrong?

Environment is Windows XP x64, Python 2.6, Ruby 1.8.6

Python

f = open('c:\\file1.txt', 'r')

hashes = dict()

for line in f.readlines():
    if not line in hashes:
        hashes[line] = 1
    else:
        hashes[line] += 1


print "Done file 1"

f.close()

f = open('c:\\file2.txt', 'r')

for line in f.readlines():
    if not line in hashes:
        print "Hash not found!"
    else:
        hashes[line] -= 1

f.close()

print "Done file 2"

num_errors = 0

for key in hashes.keys():
    if hashes[key] != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

Ruby

file = File.open("c:\\file1.txt", "r")
hashes = {}

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] += 1
  else
    hashes[line] = 1
  end
}

file.close()

puts "Done file 1"

file = File.open("c:\\file2.txt", "r")

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] -= 1
  else
    puts "Hash not found!"
  end
}

file.close()

puts "Done file 2"

num_errors = 0
hashes.each_key{ |key|
  if hashes[key] != 0
    num_errors += 1
  end
}

puts "Total of #{num_errors} mismatches found"

EDIT To give an idea of scale, each file is pretty big, over 900 000 hashes.

PROGRESS

Using a nathanvda's suggestions, here is the optimized ruby script:

f1 = "c:\\file1.txt"
f2 = "c:\\file2.txt"

hashes = Hash.new(0)

File.open(f1, "r") do |f|
  while line = f.gets
    hashes[line] += 1
  end
end  

not_founds = 0

File.open(f2, "r") do |f|
  while line = f.gets
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

num_errors = hashes.values.to_a.select { |z| z != 0}.size   

puts "Total of #{not_founds} lines not found in file2"
puts "Total of #{num_errors} mismatches found"

On windows with Ruby 1.8.7, the original version took 250 seconds and the optimized version took 223.

On a linux VM! running ruby 1.9.1, the original version ran in 81 seconds, about 1/3 the time as windows 1.8.7. Interestingly, the optimized version took longer at 89 seconds. Note that while line = ... was necessary due to memory constraints.

On windows with Ruby 1.9.1, the original took 457 seconds and the optimized version took 543.

On windows with jRuby, the original took 45 seconds and the optimized version took 43.

I am somewhat surprised by the results, I was expecting 1.9.1 to be an improvement over 1.8.7.

tscho
  • 211
  • 2
  • 7
  • There are many other Ruby VMs available, most if not all of which are faster than Ruby 1.8.6 on windows. IronRuby might be your best bet. See http://antoniocangiano.com/2009/08/03/performance-of-ironruby-ruby-on-windows/ – hgmnz Dec 08 '09 at 00:25
  • 1
    using readlines() with the file will read the entire file into memory and create a huge list. You can just iterate over the file a line at a time as I have shown in my answer. This may actually be slower than using readlines, but is more memory efficient – John La Rooy Dec 08 '09 at 00:27
  • The first thing you should do before posting is to simplify the question. Remove the comparison and the second file entirely, and you'll still see the difference, which makes it easier for everyone. From there, it's easy to see that both Python's file reading and dicts are much faster than Ruby's (at least 1.8's). Python is about 2-3x faster for me in this case; the extra code only changes the scale for me, not the factor. – Glenn Maynard Dec 08 '09 at 01:12
  • 1
    Unrelated caution: `readlines` and `each_line` return lines including any newline marker at the end. If the last hash in the file isn't followed by a newline terminator, it will come out without `\n` and won't match a previous hash from another line that did have the `\n`. – bobince Dec 08 '09 at 02:31
  • @bobince: and so will variations in the number of trailing spaces on lines etc etc ... depends on how the OP wants to define "same" ... line.rstrip() or line.rstrip('\n') or no strip at all – John Machin Dec 08 '09 at 05:38
  • It makes no difference, but Python's `with open(...) as f: for line in f...` and Ruby's `File.open(...) do |f| f.each_line ...` would be better style, as it saves you from having to close `f` manually, is exception-safe, etc. – ephemient Dec 08 '09 at 05:51
  • @John: yes, indeed. And it's probably not an issue in this case anyway. But this is probably the most common case where the intuitive definition of what a ‘line’ is and what the language does isn't quite the same; I've seen a fair few tools tripped up by assuming the last line has a newline terminator. – bobince Dec 08 '09 at 13:30

8 Answers8

5

It could be because dicts in Python are much faster than hashes in Ruby

I've just run a quick test, building a hash of 12345678 item in Ruby1.8.7 took 3 times as long as Python. Ruby1.9 was about twice as long as Python.

Here is how I tested
python

$ time python -c "d={}
for i in xrange(12345678):d[i]=1"

ruby

$ time ruby -e "d={};12345678.times{|i|d[i]=1}"

Not enough to account for your discrepancy though.

Perhaps file I/O is worth looking into - comment out all the hash code and see how long the empty loops take to run over the files.

Here's another version in Python using defaultdict and context managers

from collections import defaultdict
hashes = defaultdict(int)

with open('c:\\file1.txt', 'r') as f:
    for line in f:
        hashes[line] += 1

print "Done file 1"

with open('c:\\file2.txt', 'r') as f:
    for line in f:
        if line in hashes:
            hashes[line] -= 1
        else:
            print "Hash not found!"

print "Done file 2"

num_errors = 0
for key,value in hashes.items():  # hashes.iteritems() might be better here
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • I believe you mean `for key, value in hashes.items()` (or `hashes.iteritems()`) rather than `hashes.keys()`. – Chris Lutz Dec 08 '09 at 00:41
  • In his case, he is hashing strings. changing your python test to use: d[str(i)]=1 gives me a fourfold increase in time. I don't know enough Ruby to change the Ruby example and compare. – Mark Peters Dec 08 '09 at 01:13
  • @Mark, it would be `i.to_s` for ruby – John La Rooy Dec 08 '09 at 01:36
  • @gnibbler, my times: Python (int): 0m4.359s Python (str): 0m16.202s Ruby (int): 0m8.092s Ruby (str): 0m36.648s Seems like they scale about the same. – Mark Peters Dec 08 '09 at 02:13
5

I've found Ruby's reference implementation (well, Ruby) to be (unscientifically stated) dog slow.

If you have the opportunity, please try running your program under JRuby! Charles Nutter and other Sun folks claim to have sped Ruby up dramatically.

I for one would be most interested in your results.

Carl Smotricz
  • 66,391
  • 18
  • 125
  • 167
  • 2
    Ruby 1.8.6 in particular was not very fast. And the version of Python in question is quite a bit newer. Ruby 1.9 would surely be better. – Chuck Dec 08 '09 at 00:20
  • 4
    I tried jRuby, and while it was MUCH faster than the 1.8.6 interpreter, at ~45 seconds it was still 3x slower than Python (~15s) – tscho Dec 08 '09 at 00:28
  • Interesting. Thank you VERY much! – Carl Smotricz Dec 08 '09 at 00:29
2

On the python side, you could iterate over the dictionary items like this:

for key, value in hashes.iteritems():
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

Also:

for line in f.readlines():
    hashes[line] = hashes.setdefault(line, 0) + 1

... but I can't help you with the Ruby side, other than to suggest you hunt down a profiler.

retracile
  • 12,167
  • 4
  • 35
  • 42
0

I'm not a Ruby expert, so someone please correct me if I'm wrong:

I see a small optimization potential.

If you say

hashes = hash.new(0)

then a reference to an undefined hash will return 0 and store the key; and you can do

hashes[line] += 1

every time, without the enclosing if and else.

Caveat: Untested!

If storing the key doesn't happen automatically, there's yet another hash constructor using a block where you can do it explicitly.

Carl Smotricz
  • 66,391
  • 18
  • 125
  • 167
0

Python's dictionary is brutally fast. See How are Python's Built In Dictionaries Implemented Perhaps Ruby's is not so crash hot.

I doubt it's the hash functions. There's no way the Ruby devs would have a hash function that's an order of magnitude worse than Python's.

Perhaps Ruby 1.8 is slow at dynamically resizing large hash tables? How does your problem scale with smaller files?

Community
  • 1
  • 1
wisty
  • 6,981
  • 1
  • 30
  • 29
  • In terms of scale, the difference between inputs of 1 000 and 10 000 is minimal. The difference between 10 000 and 100 000 is nearly linear. The difference between 100 000 to near 1 000 000 is massive. This suggests to me that it is scaling exponentially. – tscho Dec 08 '09 at 17:56
  • @t_scho: Could the latter also suggest that it's hitting some sort of limit (such as RAM)? – Andrew Grimm Mar 07 '11 at 02:19
0

I was able to speed up your ruby code a bit as follows:

require 'benchmark'

Benchmark.bm(10) do |x|

  x.report("original version") do
    file = File.open("c:\\file1.txt", "r")
    hashes = {}

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] += 1
      else
        hashes[line] = 1
      end
    }

    file.close()

    #puts "Done file 1"

    not_founds = 0

    file = File.open("c:\\file2.txt", "r")

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] -= 1
      else
        not_founds += 1        
      end
    }

    file.close()

    #puts "Done file 2"

    num_errors = 0
    hashes.each_key{ |key|
      if hashes[key] != 0
        num_errors += 1
      end
    }

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end


  x.report("my speedup") do
    hashes = {}
    File.open("c:\\file1.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] += 1
        else
          hashes[line] = 1
        end
      }
    end  

    not_founds = 0

    File.open("c:\\file2.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] -= 1
        else
          not_founds += 1
        end
      }
    end

    num_errors = hashes.values.to_a.select { |z| z != 0}.size   

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end

end

so i read the files in one bug chunk, this is in my case a bit faster (i tested on Windows XP, ruby 1.8.6 and a file of 100000 lines). I benchmarked all different ways to read files (i could think off), and this was the fastest way. Also i did speed up the counting of the values in a hash a bit, but this is only visible if you did it for very large numbers :)

So i get a very small speed-increase here. The output on my machine is as follows:

                user     system      total        real
original versionTotal of 16 lines not found in file2
Total of 4 mismatches found
   1.000000   0.015000   1.015000 (  1.016000)
my speedup v1Total of 16 lines not found in file2
Total of 4 mismatches found
   0.812000   0.047000   0.859000 (  0.859000)

Who has any ideas to improve this further?

If the f.readlines goes slower, because of the size, i found that

File.open("c:\\file2.txt", "r") do |f|
  while (line=f.gets)
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

is just a tad quicker for me.

I was thinking about about a way to improve the

if hashes.has_key?(line) ...

code a bit, but could not think of anything.

Have you tried using Ruby 1.9?

I have a Windows 7 Virtual Machine with Ruby 1.9.1, and there the f.readlines was slower, and i needed to use the while (line=f.gets) because of the memory limitations :)

Since a lot uf Ruby users test mainly on Unix related platforms, i guess that could explain why the code is sub-optimal on Windows. Has anybody compared the above mentioned performance on Unix? Is this a ruby vs. python problem, or Ruby-windows vs. Ruby-Unix?

nathanvda
  • 49,707
  • 13
  • 117
  • 139
0

I'd bet results of Ruby 1.9.x, which is faster or on par with Python in most areas, are caused by an additional overhead required by hashes/dictionaries implementation because the are ordered in Ruby contrary to Python.

DavidUnc
  • 31
  • 1
0

I'll try to do a benchmark in my copious free time, but try using group_by. It's not only more like functional programming, but I've found it to be a lot faster than the procedural version in MRI.

def convert_to_hash(file)
  values_hash = file.each_line.group_by {|line| line}
  # Hash.[] converts an array of pairs into a hash
  count_hash = Hash[ values_hash.map{|line, lines| [line, lines.length]}]
  count_hash
end

hash1 = convert_to_hash(file)
hash2 = convert_to_hash(file2)
# compare if the two hashes are equal
Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338