File::Slurp into a multi-GB scalar - how to split efficiently?

Question

I have a multi-GB file to process in Perl. Reading the file line-by-line takes several minutes; reading it into a scalar via File::Slurp takes a couple of seconds. Good. Now, what is the most efficient way to process each "line" of the scalar? I imagine that I should avoid modifying the scalar, e.g. lopping off each successive line as I process it, to avoid reallocating the scalar.

I tried this:

use File::Slurp;
my $file_ref = read_file( '/tmp/tom_timings/tom_timings_15998', scalar_ref => 1  ) ;

for my $line (split /\n/, $$file_ref) {
    # process line
}

And it's sub-minute: adequate but not great. Is there a faster way to do this? (I have more memory than God.)

`read_file` also allows you to read to an array: `my @lines = read_file( 'filename' );` Of course, you'll still have to loop through the entire array to process each line, so it doesn't change things much. — ThisSuitIsBlackNot, Feb 12 '14 at 18:02
The reason it's slow is it needs to go through the file looking for newlines. If they're fixed width lines you can seek by bytes through the file, which should be faster. If they're variable length lines there's no real way around it. — Oesor, Feb 12 '14 at 18:09

ikegami · Answer 1 · 2014-02-12T18:57:59.420

split should be very fast unless you start swapping. The only way I can see to speed it up is to write an XS function that looks for LF rather than use a regex.

As an aside, you could save a lot of memory by switching to

while ($$file_ref =~ /\G([^\n]*\n|[^\n]+)/g) {
    my $line = $1;
    # process line
}

Said XS function. Move the newSVpvn_flags line after the if statement if you don't want to chomp.

SV* next_line(SV* buf_sv) {
    STRLEN buf_len;
    const char* buf = SvPV_force(buf_sv, buf_len);
    char* next_line_ptr;
    char* buf_end;
    SV* rv;

    if (!buf_len)
        return &PL_sv_undef;

    next_line_ptr = buf;
    buf_end = buf + buf_len;
    while (next_line_ptr != buf_end && *next_line_ptr != '\n')
        ++next_line_ptr;

    rv = newSVpvn_flags(buf, next_line_ptr-buf, SvUTF8(buf_sv) ? SVf_UTF8 : 0);

    if (next_line_ptr != buf_end)
        ++next_line_ptr;

    sv_chop(buf_sv, next_line_ptr);
    return rv;  /* Typemap will mortalize */
}

Means of testing it:

use strict;
use warnings;

use Inline C => <<'__EOC__';

SV* next_line(SV* buf_sv) {
    ...
}

__EOC__

my $s = <<'__EOI__';
foo
bar
baz
__EOI__

while (defined($_ = next_line($s))) {
   print "<$_>\n";
}

very useful, thank you (and, five years later on v5.29.2 it still works like a charm :) — zdim, Dec 10 '19 at 05:23
@zdim, ok, but 5.29 is a dev version, and 5.30 was released "long ago"! — ikegami, Dec 10 '19 at 05:45
Good point -- that's just the most recent one that I have fully set up. Now tried with 5.30.0 (the latest perlbrew I got) and it works as well — zdim, Dec 10 '19 at 06:05

File::Slurp into a multi-GB scalar - how to split efficiently?

1 Answers1