7

I have a rudimentary script in Perl6 which runs very slowly, about 30x slower than the exact perl5 translation.

CONTROL {
    when CX::Warn {
        note $_;
        exit 1;
    }
}
use fatal;
role KeyRequired {
    method AT-KEY (\key) {
        die "Key {key} not found" unless self.EXISTS-KEY(key);
        nextsame;
    }
}

for dir(test => /^nucleotide_\d**2_\d**2..3\.tsv$/) -> $tsv {
    say $tsv;
    my $qqman = $tsv.subst(/\.tsv$/, '.qqman.tsv');
    my $out = open $qqman, :w;
    put "\t$qqman";
    my UInt $line-no = 0;
    for $tsv.lines -> $line {
        if $line-no == 0 {
            $line-no = 1;
            $out.put(['SNP', 'CHR', 'BP', 'P', 'zscore'].join("\t"));
            next
        }
        if $line ~~ /.+X/ {
            next
        }
        $line-no++;
        my @line = $line.split(/\s+/);
        my $chr = @line[0];
        my $nuc = @line[1];
        my $p = @line[3];
        my $zscore = @line[2];
        my $snp = "'rs$line-no'";
        $out.put([$snp, $chr, $nuc, $p, $zscore].join("\t"));
        #$out.put();
    }
    last
}

this is idiomatic in Perl5's while.

This is a very simple script, which only alters columns of text in a file. This Perl6 script runs in 30 minutes. The Perl5 translation runs in 1 minute.

I've tried reading Using Perl6 to process a large text file, and it's Too Slow.(2014-09) and Perl6 : What is the best way for dealing with very big files? but I'm not seeing anything that could help me here :(

I'm running Rakudo version 2018.03 built on MoarVM version 2018.03 implementing Perl 6.c.

I realize that Rakudo hasn't matured to Perl5's level (yet, I hope), but how can I get this to read the file line by line in a more reasonable time frame?

jjmerelo
  • 22,578
  • 8
  • 40
  • 86
con
  • 5,767
  • 8
  • 33
  • 62
  • What makes you think reading a file line-by-line is the bottle neck for your script? – ugexe Apr 12 '19 at 22:07
  • @ugexe the math is otherwise very simple. I didn't expect it to take a long time otherwise. However, I will trim the script down to verify that line-by-line is in fact the bottle neck. – con Apr 12 '19 at 22:11
  • There is much more than simple math going on. There is IO (opening/reading file), assignment, regex parsing, and type constraint checks. – ugexe Apr 12 '19 at 22:13
  • @ugexe it's definitely the regex that's slowing it down. Is there a more idiomatic way of writing `if $line ~~ m:P5/.+X/ {` in Perl6? I thought that would be virtually instant – con Apr 12 '19 at 23:45
  • 1
    Seems like you could change `$line ~~ /.+X/` to `$line.index("X")` since you are not capturing anything. – ugexe Apr 13 '19 at 00:29
  • 1
    if you working on multiple files you could try to work on them in parallel by changing from for to map and then use [hyper or race](https://6guts.wordpress.com/2017/03/16/considering-hyperrace-semantics/). – LuVa Apr 13 '19 at 06:55

1 Answers1

11

There is a bunch of things I would change.

  • /.+X/ can be simplified to just /.X/ or even $line.substr(1).contains('X')
  • $line.split(/\s+/) can be simplified to $line.words
  • $tsv.subst(/\.tsv$/, '.qqman.tsv') can be simplified to $tsv.substr(*-4) ~ '.qqman.tsv'
  • uint instead of UInt
  • given .head {} instead of for … {last}
given dir(test => /^nucleotide_\d**2_\d**2..3\.tsv$/).head -> $tsv {
    say $tsv;
    my $qqman = $tsv.substr(*-4) ~ '.qqman.tsv';
    my $out = open $qqman, :w;
    put "\t$qqman";

    my uint $line-no = 0;
    for $tsv.lines -> $line {
        FIRST {
            $line-no = 1;
            $out.put(('SNP', 'CHR', 'BP', 'P', 'zscore').join("\t"));
            next
        }
        next if $line.substr(1).contains('X');

        ++$line-no;

        my ($chr,$nuc,$zscore,$p) = $line.words;

        my $snp = "'rs$line-no'";
        $out.put(($snp, $chr, $nuc, $p, $zscore).join("\t"));
        #$out.put();
    }
}
Brad Gilbert
  • 33,846
  • 11
  • 78
  • 129
  • `if $line ~~ /.X/ {` is indeed the bottleneck. Amazing, removing that one `+` shortens from 30 minutes to 9 minutes. `$line.substr(1).contains('X')` isn't quite the same, because the X could be at the front of the line, I think? X at the front of the line is acceptable. – con Apr 13 '19 at 15:40
  • 3
    @con: `/.+X/` `/.X/`, and `$line.substr(1).contains('X')` are functionally identical. `/.*X/` `/X/` and `$line.contains('X')` are functionally identical. (The `.substr(1)` is so that it doesn't match `X` at the beginning of the line.) – Brad Gilbert Apr 13 '19 at 19:12
  • 3
    Although the OP is mainly looking at speed, they briefly mention idiomaticness as well, so for processing line by line on a file I'm opening up, I've been finding the format `for $tsv.IO.lines -> $line` to be a much nicer way. Ditto for `$out.put: .join("\t")`, despite being functionally the same. – user0721090601 Apr 13 '19 at 22:12