Ruby String operations on HUGE String

Question

I have a string that is ~10 GB in size (huge RAM usage ofc..). The thing is, I need to perform string operations like gsub and split on it. I noticed that Ruby will just "stop working" at some point (without yielding any errors though).

Example:

str = HUGE_STRING_10_GB

# I will try to split the string using .split:
str.split("\r\n")
# but Ruby will instead just return an array with 
# the full unsplitted string itself...

# let's break this down:
# each of those attempts doesn't cause problems and 
# returns arrays with thousands or even millions of items (lines)
str[0..999].split("\r\n")
str[0..999_999].split("\r\n")
str[0..999_999_999].split("\r\n")

# starting from here, problems will occur
str[0..1_999_999_999].split("\r\n")

I'm using Ruby MRI 1.8.7, what is wrong here? Why is Ruby not able to perform string operations on huge strings? And what is a solution here?

The only solution I came up with is to "loop" through the string using [0..9], [10..19],... and to perform the string operations part by part. However this seems unreliable, for example what if my split delimiter is very long and falls between two "parts".

Another solution that actually works fine is to iterate the string by like str.each_line {..}. However this just replaces newline delimiters.

EDIT: Thanks for all those answers. In my case, the "HUGE 10 GB STRING" is actually a download from the internet. It contains data that is delimited by a specific sequence (in most cases a simple newline). In my scenario I compare EACH ELEMENT of the 10 GB file to another (smaller) data-set that I already have in my script. I appreciate all suggestions.

I would highly suggest _not_ using a dynamic language like Ruby for this. Also, how much RAM do you have? — Dogbert, May 08 '13 at 11:36
I'd also highly recommend not doing these operations on the full string all at once. — mcfinnigan, May 08 '13 at 11:37
And what kind of gsub and split (what regex/strings exactly) do you want to perform on this? — Dogbert, May 08 '13 at 11:40
Regexes on a 10G string? That won't end well. What's the purpose? What's the goal? Would you be better off parsing the string? Without any details it's impossible to provide reasonable advice other than it's very unlikely that operations on an entire 10G string will blow up the world. — Dave Newton, May 08 '13 at 12:56
Either way it's as bad as crossing the streams, which, if I remember correctly, was "bad". — the Tin Man, May 08 '13 at 15:21
We don't really know if it's a problem for Ruby. The OP is not processing a big file correctly in the first place, putting Ruby, Perl, Python, Java, or almost any other language that allows slurping files in a bad situation. I'd use C or C++ for this task if I had to read that big of a file at once on one of my hosts, but it still makes no sense to slurp when line-by-line I/O is more scalable. — the Tin Man, May 08 '13 at 17:23

score 8 · Accepted Answer · edited May 23 '17 at 12:16

Here's a benchmark against a real-life log file. Of the methods used to read the file, only the one using foreach is scalable because it avoids slurping the file.

Using lazy adds overhead, resulting in slower times than map alone.

Notice that foreach is right in there as far as processing speed goes, and results in a scalable solution. Ruby won't care if the file is a zillion lines or a zillion TB, it's still only seeing a single line at a time. See "Why is "slurping" a file not a good practice?" for some related information about reading files.

People often gravitate to using something that pulls in an entire file at once, then splitting it into parts. That ignores the job Ruby then has to do to rebuild the array based on line ends using split or something similar. That adds up, and is why I think foreach pulls ahead.

Also notice that the results shift a little between the two benchmark runs. This is probably due to system tasks running on my Mac Pro as the jobs are running. The important thing is that shows the difference is a wash, confirming to me that using foreach is the right way to process big files, because it's not going to kill the machine if the input file exceeds available memory.

require 'benchmark'

REGEX = /\bfoo\z/
LOG = 'debug.log'
N = 1

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def lazy_map(filename)
  File.open("lazy_map.out", 'w') do |fo|
    fo.puts File.readlines(filename).lazy.map { |li|
      li.gsub(REGEX, 'bar')
    }.force
  end
end

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def map(filename)
  File.open("map.out", 'w') do |fo|
    fo.puts File.readlines(filename).map { |li|
      li.gsub(REGEX, 'bar')
    }
  end
end

# "Reads the entire file specified by name as individual lines, and returns
# those lines in an array."
# 
# As a result of returning all the lines in an array this isn't scalable. It
# will work if Ruby has enough memory to hold the array plus all other
# variables and its overhead.
def readlines(filename)
  File.open("readlines.out", 'w') do |fo|
    File.readlines(filename).each do |li|
      fo.puts li.gsub(REGEX, 'bar')
    end
  end
end

# This is completely scalable because no file slurping is involved.
# "Executes the block for every line in the named I/O port..."
#
# It's slower, but it works reliably.
def foreach(filename)
  File.open("foreach.out", 'w') do |fo|
    File.foreach(filename) do |li|
      fo.puts li.gsub(REGEX, 'bar')
    end
  end
end

puts "Ruby version: #{ RUBY_VERSION }"
puts "log bytes: #{ File.size(LOG) }"
puts "log lines: #{ `wc -l #{ LOG }`.to_i }"

2.times do
  Benchmark.bm(13) do |b|
    b.report('lazy_map')  { lazy_map(LOG)  }
    b.report('map')       { map(LOG)       }
    b.report('readlines') { readlines(LOG) }
    b.report('foreach')   { foreach(LOG)   }
  end
end

%w[lazy_map map readlines foreach].each do |s|
  puts `wc #{ s }.out`
end

Which results in:

Ruby version: 2.0.0
log bytes: 733978797
log lines: 5540058
                    user     system      total        real
lazy_map       35.010000   4.120000  39.130000 ( 43.688429)
map            29.510000   7.440000  36.950000 ( 43.544893)
readlines      28.750000   9.860000  38.610000 ( 43.578684)
foreach        25.380000   4.120000  29.500000 ( 35.414149)
                    user     system      total        real
lazy_map       32.350000   9.000000  41.350000 ( 51.567903)
map            24.740000   3.410000  28.150000 ( 32.540841)
readlines      24.490000   7.330000  31.820000 ( 37.873325)
foreach        26.460000   2.540000  29.000000 ( 33.599926)
5540058 83892946 733978797 lazy_map.out
5540058 83892946 733978797 map.out
5540058 83892946 733978797 readlines.out
5540058 83892946 733978797 foreach.out

The use of gsub is innocuous since every method uses it, but it's not needed and was added for a bit of frivolous resistive loading.

+1, thanks for the benchmark - this is something I need to learn with you. — fotanus, May 08 '13 at 19:26
Benchmarks are really useful when we're not sure which way to go and can be very enlightening. — the Tin Man, May 08 '13 at 23:21

Neil Slater · Answer 2 · 2013-05-08T11:57:32.693

4

If you want to process a large file, line-by-line, this will be much more resilient and less memory-hungry:

File.open('big_file.log') do |file|
  file.each_line do |line|
     # Process the line
  end
end

This approach would not let you cross-reference lines, but if you need that, consider using a scratch database.

edited May 08 '13 at 11:57

answered May 08 '13 at 11:41

Neil Slater

26,512
6
76
94

1

`readfiles` returns the complete array at once, after reading the whole file into memory. – Dogbert May 08 '13 at 11:45
Hmm, actually that still loads into RAM . . . so not great – Neil Slater May 08 '13 at 11:45
@Dogbert - Thanks, I have altered the code, and think this is now a working low-memory solution for the OP. – Neil Slater May 08 '13 at 11:55
3

Don't `open` then `each_line` to read it, use `File.foreach` instead. – the Tin Man May 08 '13 at 14:09

i-blis · Answer 3 · 2013-05-08T16:29:43.310

2

I ran into this problem before. Unfortunately Ruby doesn't have the equivalent of Perl's Tie::File, which processes file lines on disk. In case you have Perl on the machine and don't worry about being disloyal to Ruby just once, give the following piece of code a shot:

use strict;
use Tie::File;

my $filename = shift;

tie my @lines, 'Tie::File', $filename 
    or die "Coud not open $filename\n";

for (@lines) {              # process all the lines as you see fit
    s/RUBY/ruby/g;         
    }

# you can cross reference lines if necessary

$lines[0] = $lines[99] . "!";   # replace the content of the first line with that 100th + "!"

untie @lines;

You can process files (almost) as big as you want.

If you can use Ruby 2.0, a solution would be to build an enumerator (even a lazy which dimish memory consumption when processing). Like this for instance (processes just as necessary, faster by far than the same without .lazy, so I guess file is not fully loaded in memory and each lines is deallocated as we process):

File.open("dummy.txt") do |f| 
    f.lazy.map do |l|
        l.gsub(/ruby/, "RUBY")
    end.first(10)
end

All of this also depends on how you would process the output.

I did some benchmarking. On Ruby 2.0.0 at least each_line keeps memory consumption pretty low: under 64 MB processing a 512 MB file (where each line had the word "RUBY"). Laziness (replacing each_line with lazy.each in the code below) does not provide any improvement in memory usage nor in execution time.

File.open("dummy", "w") do |out|
    File.open("DUMMY") do |f| 
        f.each_line do |l|
            out.puts l.gsub(/RUBY/, "ruby")
        end
    end
end

edited May 08 '13 at 16:29

answered May 08 '13 at 12:30

i-blis

3,149
24
31

why using `flat_map` in this case? Also, what is the benefit lazy a map if there is no intermediate array? – fotanus May 08 '13 at 13:23
@fotanus the lazy map avoids reading the whole file, doesn't it? The `each_line` in between was a typo. I removed it. As such doesn't it lazily reads the file? – i-blis May 08 '13 at 13:39
I'm not sure, [reading this](http://railsware.com/blog/2012/03/13/ruby-2-0-enumerablelazy/) we can see that it does not expand into an array. But will it load all file in memory? I'm not sure either, but I'm curious. – fotanus May 08 '13 at 13:47
@fotanus on my machine lazily processing the (first 10) entries of 1GB is way faster. So I guess it doesn't load until needed. I'm curious too on that matter. – i-blis May 08 '13 at 13:49
I suspect that `map` is the important part and that, in this case, `lazy` isn't really adding anything that useful. `map` is reading the file line by line, not trying to slurp it, which would be a lot faster than trying to read a 10GB file into RAM, with all its attending memory allocations, and probable swapping that would occur. – the Tin Man May 08 '13 at 14:04
@the_thin_man when processing but the first lines the lazy variant is way faster, because it reads but what necessary. Doesn't it mean that we spare on memory allocation when processing the whole file? Or is the "normal" Enum smart enough to deallocate when it processes further? – i-blis May 08 '13 at 14:09
We'll need to run Benchmarks on that particular use, but I suspect `lazy` isn't really helping nearly as much as using the line I/O that `map` adds. The data is read line by line into a string, so it gets reused during each loop. There's nothing to worry about as far as memory deallocation. Ruby is happy to read terabytes as long as the line read doesn't fill the available memory. – the Tin Man May 08 '13 at 14:12
@theTinMan In the case above (i.e. processing but the 5 first entries) with a 5GB file: with lazy: real=0.00111, without: real=9.133331. When processing the whole file though, there is but little difference lazy: 7.572588, no-lazy: 8.488584 (12% gain). – i-blis May 08 '13 at 14:25
Great suggestions i-blis and tin man. Do you see any way how I could multi-thread this solution (e.g. forks)? For example, would it be possible to read only specific parts of the file with lazy/map or File.foreach? Then I could launch multiple forks, each reading a specific part of the file and processing it. – Benedikt B May 08 '13 at 15:08
1

Reading through the docs for Enumerable#map, it looks like there's a problem when handling a file of that size still. `map` "Returns a new array with the results of running block once for every element in enum." which means it's all still being pulled into memory. `lazy` delays that, but doesn't remove that "returns a new array" functionality of `map`, it only delays it as long as possible. In the OPs case, the file still needs to be read in its entirety for processing, and pulling it into RAM is what's killing Ruby's processing of the file. – the Tin Man May 08 '13 at 15:16
@theTinMan you're right on the lazy side of the problem, it simply delays but apparently doesn't affect memory allocation. Still not sure, doing some benchmarking (the one above was on a 5MB file not 5GB!) – i-blis May 08 '13 at 15:31
@BenediktB Did you consider using the Perl script, it is pretty efficient. – i-blis May 08 '13 at 15:32
Similar to the answer by @NeilSlater about using `open` and `each_line`, use `foreach` instead. – the Tin Man May 08 '13 at 17:17
@theTinMan Right, best practice idiom. But boils down to the same, since we already use a block here. – i-blis May 08 '13 at 17:23
Correct. It's just using the "more correct" method vs. the more verbose. – the Tin Man May 08 '13 at 17:35

score 1 · Answer 4 · answered May 08 '13 at 11:41

1

Do you even have the 10+GB to fit the string in memory?

I assume the string is loaded from a file, so consider processing the file directly using each_line or something to that order...

answered May 08 '13 at 11:41

Denis de Bernardy

75,850
13
131
154

score 1 · Answer 5 · edited May 08 '13 at 15:05

I noticed that Ruby will just "stop working" at some point (...) I'm using Ruby MRI 1.8.7, what is wrong here?

Unless you have a lot of RAM, this is because you are experiencing thrashing at your application level, that is, it can't have much done each time it gains the CPU control because it have be swapping the memory in the disk all the time.

Why is Ruby not able to perform string operations on huge strings?

I suspect no one is, unless reading it in parts from a file.

And what is a solution here?

I couldn't help to notice you are trying to split your file in strings, and afterwards want to match substrings in a regexp. So I can see two alternatives

(simple): If your regexps uses only one line, you can perform better with this text in a textfile and do a grep system call to retrieve whatever you need - grep was already been created to deal with huge files, so you don't have to worry it yourself.
(complex): However, if your regexp is a multiline regexp, you will have to read parts of your file with the read call, specifying how many bytes you want t read at once. Then you will have to manage what is being matched, and concat the end of the string that was not matched, because joining it with the next part of bytes it can create a match pattern. At this point, as @Dogbert suggested, you might start to think in changing to a static language, because you will be programming in a low level anyway. Maybe create a ruby C extension?

If you need more details on your approach, let me know and I can write more about one of the two above.

Thanks for the trashing explanation, that makes sense! I updated the OP with an edit. Grep is a good suggestion, but since I need to process the whole file at once I think File.foreach is the best solution so far. I'm now still looking for a multi-threading approach, e.g. to have multiple processes each read a part of this file and process it (I wrote about this in i-blis's answer) — Benedikt B, May 08 '13 at 15:15
Um, as @WayneConrad corrected, the term is "thrashing", not "trashing". They have different connotations and meaning. — the Tin Man, May 08 '13 at 17:11
@BenediktB I suppose you didn't addressed parallelism in the original question, and so it is better create a new question - changing it now will invalidate all answers. — fotanus, May 08 '13 at 19:26
Understood, I created it here http://stackoverflow.com/questions/16454467/ruby-file-reading-parallelisim — Benedikt B, May 09 '13 at 04:40

Stefan · Answer 6 · 2013-05-08T14:15:37.237

1

Assuming that the string is read from disk, you could use foreach to read and process one line at a time, writing each one back to disk. Something like:

File.open("processed_file", "w") do |dest|
  File.foreach("big_file", "\r\n") do |line|
    # processing goes here
    dest << line
  end
end

edited May 08 '13 at 14:15

answered May 08 '13 at 13:31

Stefan

109,145
14
143
218

1

Instead of `new`, use `open`, and use the block forms when you do. Ruby will automatically close the files for you then. And, instead of `open('some/file', 'r') ... each_line` use `File.foreach('some/file')` to iterate over it. – the Tin Man May 08 '13 at 14:07

Ruby String operations on HUGE String

6 Answers6

Linked