0

My current script:

old_ids = []
File.open("missing.txt","w") do |result|
  File.open('query_result.csv', "r") do |old_copy| 
    old_copy.each_line do |line|
      old_ids << line.chomp.to_i #add exisiting ids to array
    end
  end

  File.open('item_info', "r") do |new_copy|
    new_copy.each_line do |line|
      if !old_ids.include? line.split("\t")[0].chomp.to_i #find all ids in new file that do not exist in old file and add them to missing.txt
         result.puts line.split("\t")[0].chomp.to_i
         puts "Adding #{line.split("\t")[0].chomp.to_i}"
          end
        end
      end
    end

How can refactor this to be more efficient. The files I parse contain ~2.6 million records, so as written, this take a really long time to execute.

Yogzzz
  • 2,735
  • 5
  • 36
  • 56
  • This SO question could interest you http://stackoverflow.com/questions/3128374/what-is-the-best-diff-library-in-ruby – toch Mar 07 '13 at 08:39

3 Answers3

2

Lowest hanging fruit? Replace your use of an Array and Array#include with a Hash and Hash#key?:

old_ids = {}
File.open("missing.txt","w") do |result|
  File.foreach('query_result.csv') do |line|
    id = line.chomp.to_i
    old_ids[id] = id
  end

  File.foreach('item_info') do |line|
    id = line.split("\t")[0].chomp.to_i
    unless old_ids.key?(id)
      result.puts id
      puts "Adding #{id}"
    end
  end
end

The reason is simple: Array#include? scans the entire array every time looking for the given value, giving it O(n2) complexity overall. Hash#key?, on the other hand, computes the hash for the given value then performs a lookup to see if the given key is present in the hash table. Amortized time complexity approaches O(n).

A simple test case between the two (finding common lines between two files) yields:

$ time ruby include.rb
real    2m51.409s
user    2m51.246s
sys     0m0.138s

versus

$ time ruby key.rb
real    0m0.092s
user    0m0.082s
sys     0m0.009s

That's with two files of 216 lines each. At 10,000 lines, include still takes 5 seconds, while key? takes all of 29ms.

With 2 million lines, key? takes under 4 seconds. I don't think I can wait around for the Array#include? implementation to finish.

Alistair A. Israel
  • 6,417
  • 1
  • 31
  • 40
1

As you want to detect the additional ids, you are not so far from an optimal solution. You can speed up the lookup by building a Set rather than putting your old ids in an array. It's faster.

old_ids = Set.new
File.open("missing.txt","w") do |result|
  File.open('query_result.csv', "r") do |old_copy| 
    old_copy.each_line do |line|
      old_ids.add(line.chomp.to_i) #add exisiting ids to Set
    end
  end

  File.open('item_info', "r") do |new_copy|
    new_copy.each_line do |line|
      if !old_ids.include? line.split("\t")[0].chomp.to_i #test if the id exists in the Set old_ids
         result.puts line.split("\t")[0].chomp.to_i
         puts "Adding #{line.split("\t")[0].chomp.to_i}"
        end
      end
    end
  end
end

I can't test without an example of files. There was one error in your code (old_wmt_ids).

toch
  • 3,905
  • 2
  • 25
  • 34
  • Thanks for reminding me that Ruby has a `Set` class. Was going to edit my answer to replace `Hash` and `Hash#key?` with `Set` and `Set#include?` but then I looked at the source and (as I suspected) `Set` just uses a `@hash` under the covers so I've left my answer as is (but you got my upvote). For the OP's case, `Set` is probably a better choice as it expresses the intent more clearly. – Alistair A. Israel Mar 07 '13 at 09:17
  • @AlistairIsrael You're welcome. I'm looking for information about the implementation of Set and Hash. Without that information it's impossible to tell the exact time complexity of the operations and make a good choice. If you have any info about that, I'm very interested. – toch Mar 07 '13 at 09:23
  • I just based it on the source for [`Set#include?`](http://www.ruby-doc.org/stdlib-2.0/libdoc/set/rdoc/Set.html#method-i-include-3F) and [`Hash#key?`](http://ruby-doc.org/core-2.0/Hash.html#method-i-include-3F). If you follow the URLs to the Ruby docs, you can "click to toggle source". `Set#include?` uses an internal `@hash` as already mentioned, while `Hash#key?` is implemented in C—at least, with MRI. – Alistair A. Israel Mar 07 '13 at 09:28
  • You're right (complete [source of Set](https://github.com/ruby/ruby/blob/trunk/lib/set.rb)) source is totally based on Hash. That means one layer of indirection. Probably better to work with Hash. – toch Mar 07 '13 at 09:38
0

Have a look at Diffy, this is a cool gem for calculating diffs between files.

Intrepidd
  • 19,772
  • 6
  • 55
  • 63