Fastest Way to Parse a Large File in Ruby

Question

I have a simple text file that is ~150mb. My code will read each line, and if it matches certain regexes, it gets written to an output file. But right now, it just takes a long time to iterate through all of the lines of the file (several minutes) doing it like

File.open(filename).each do |line|
  # do some stuff
end

I know that it is the looping through the lines of the file that is taking a while because even if I do nothing with the data in "#do some stuff", it still takes a long time.

I know that some unix programs can parse large files like this almost instantly (like grep), so I am wondering why ruby (MRI 1.9) takes so long to read the file, and is there some way to make it faster?

I can't reproduce this. Iterating through a 150mb file takes under a second here. Certainly slower than grep, but not to the extent you're describing. Does the file maybe have very long lines? In that case reading by chunks instead of lines might help (if that's possible at all with what you're trying to do). — sepp2k, May 10 '11 at 20:52
@sepp2k each line is ~300 characters long, how long were the lines in your test file? — Davis Dimitriov, May 10 '11 at 21:07
@Henry: In my test each line was 149 characters long followed by a newline (so I had 150 characters per line on one million lines). — sepp2k, May 11 '11 at 16:41
What are you going to do with each line? That will help suggest a good way to read you file. — Seamus Abshere, Jun 13 '11 at 21:05
See http://stackoverflow.com/questions/25189262/why-is-slurping-a-file-bad for benchmarks on the fastest ways to load a file. Use `foreach` to read individual lines if you need to look at each one. It's surprisingly fast and results in very simple code when filtering like the OP wants to do. — the Tin Man, Jul 31 '15 at 17:11

score 5 · Answer 1 · answered May 10 '11 at 20:40

5

It's not really fair to compare to grep because that is a highly tuned utility that only scans the data, it doesn't store any of it. When you're reading that file using Ruby you end up allocating memory for each line, then releasing it during the garbage collection cycle. grep is a pretty lean and mean regexp processing machine.

You may find that you can achieve the speed you want by using an external program like grep called using system or through the pipe facility:

`grep ABC bigfile`.split(/\n/).each do |line|
  # ... (called on each matching line) ...
end

answered May 10 '11 at 20:40

tadman

208,517
23
234
262

but what specifically makes Ruby so slow to read the lines of a file compared to grep. Assume Ruby does absolutely no processing on those lines, just reads them and exits. – Davis Dimitriov May 10 '11 at 20:48
3

Ruby has to allocate memory for each line, then destroy that memory, which does involve a lot more work than just scanning a small, sliding buffer as `grep` does. – tadman May 10 '11 at 20:53

score 2 · Accepted Answer · answered May 10 '11 at 20:52

2

File.readlines.each do |line|
  #do stuff with each line
end

Will read the whole file into one array of lines. It should be a lot faster, but it takes more memory.

answered May 10 '11 at 20:52

steenslag

79,051
16
138
171

4

[Benchmarks show that `readlines` isn't as fast as using `foreach` for large files](http://stackoverflow.com/questions/25189262/why-is-slurping-a-file-bad). It's also not scalable. Use `foreach` instead of `readlines` and the code will remain the same, only scale and will run faster the bigger the file it reads. – the Tin Man Jul 31 '15 at 17:07

score -2 · Answer 3 · answered May 10 '11 at 20:48

-2

You should read it into the memory and then parse. Of course it depends on what you are looking for. Don't expect miracle performance from ruby, especially comparing to c/c++ programs which are being optimized for past 30 years ;-)

answered May 10 '11 at 20:48

Zepplock

28,655
4
35
50

Your code relies on Ruby tokenizer to read file and yield control to you after each line then read next line then yield again, etc. My suggestion is to read a complete file into (let's say a string or char array) in memory and pull the information you need out. – Zepplock May 10 '11 at 21:02
Looks like you try to flood about c/c++ performance, bad try - looping is just looping - all other important moments are already covered above – Wile E. Mar 24 '14 at 12:26
Don't read the file into memory. It isn't scalable and has no performance gains. http://stackoverflow.com/questions/25189262/why-is-slurping-a-file-bad – the Tin Man Jul 31 '15 at 17:07

Fastest Way to Parse a Large File in Ruby

3 Answers3