6

I have a multi-GB file to process in Perl. Reading the file line-by-line takes several minutes; reading it into a scalar via File::Slurp takes a couple of seconds. Good. Now, what is the most efficient way to process each "line" of the scalar? I imagine that I should avoid modifying the scalar, e.g. lopping off each successive line as I process it, to avoid reallocating the scalar.

I tried this:

use File::Slurp;
my $file_ref = read_file( '/tmp/tom_timings/tom_timings_15998', scalar_ref => 1  ) ;

for my $line (split /\n/, $$file_ref) {
    # process line
}

And it's sub-minute: adequate but not great. Is there a faster way to do this? (I have more memory than God.)

Chap
  • 3,649
  • 2
  • 46
  • 84
  • 1
    `read_file` also allows you to read to an array: `my @lines = read_file( 'filename' );` Of course, you'll still have to loop through the entire array to process each line, so it doesn't change things much. – ThisSuitIsBlackNot Feb 12 '14 at 18:02
  • @ThisSuitIsBlackNot - I tried that; takes a long time. – Chap Feb 12 '14 at 18:06
  • 1
    The reason it's slow is it needs to go through the file looking for newlines. If they're fixed width lines you can seek by bytes through the file, which should be faster. If they're variable length lines there's no real way around it. – Oesor Feb 12 '14 at 18:09
  • 3
    +1: "*(I have more memory than God.)*" ;) – DavidO Feb 12 '14 at 19:07

1 Answers1

6

split should be very fast unless you start swapping. The only way I can see to speed it up is to write an XS function that looks for LF rather than use a regex.

As an aside, you could save a lot of memory by switching to

while ($$file_ref =~ /\G([^\n]*\n|[^\n]+)/g) {
    my $line = $1;
    # process line
}

Said XS function. Move the newSVpvn_flags line after the if statement if you don't want to chomp.

SV* next_line(SV* buf_sv) {
    STRLEN buf_len;
    const char* buf = SvPV_force(buf_sv, buf_len);
    char* next_line_ptr;
    char* buf_end;
    SV* rv;

    if (!buf_len)
        return &PL_sv_undef;

    next_line_ptr = buf;
    buf_end = buf + buf_len;
    while (next_line_ptr != buf_end && *next_line_ptr != '\n')
        ++next_line_ptr;

    rv = newSVpvn_flags(buf, next_line_ptr-buf, SvUTF8(buf_sv) ? SVf_UTF8 : 0);

    if (next_line_ptr != buf_end)
        ++next_line_ptr;

    sv_chop(buf_sv, next_line_ptr);
    return rv;  /* Typemap will mortalize */
}

Means of testing it:

use strict;
use warnings;

use Inline C => <<'__EOC__';

SV* next_line(SV* buf_sv) {
    ...
}

__EOC__

my $s = <<'__EOI__';
foo
bar
baz
__EOI__

while (defined($_ = next_line($s))) {
   print "<$_>\n";
}
ikegami
  • 367,544
  • 15
  • 269
  • 518