1

I'm adapting an existing perl script proposed here: Fast alternative to grep -f

I need to filter many very large files (Map file), each ~10 million lines long x 5 fields wide using an also long list (Filter file) and print lines in the map file that match. I tried using grep -f, but it was simply taking too long. I read that this approach will be quicker.

This is what my files look like:

Filter file:

DB775P1:276:C2R0WACXX:2:1101:10000:77052
DB775P1:276:C2R0WACXX:2:1101:10003:51920
DB775P1:276:C2R0WACXX:2:1101:10004:36433
DB775P1:276:C2R0WACXX:2:1101:10004:57256

Map file:

DB775P1:276:C2R0WACXX:2:1101:10000:70401     chr5    21985760    21985780    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14723904    14723924    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14745586    14745606    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    7944241     7944261     - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    8402856     8402876     + 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr8    10864708    10864728    + 
DB775P1:276:C2R0WACXX:2:1101:10002:88487     chr17   5681227     5681249     - 
DB775P1:276:C2R0WACXX:2:1101:10004:74842     chr13   2569168     2569185     + 
DB775P1:276:C2R0WACXX:2:1101:10004:74842     chr14   13253418    13253435    - 
DB775P1:276:C2R0WACXX:2:1101:10004:74842     chr14   13266344    13266361    -

I expect the output lines to look like this, because they contains the string present in both the map and filter files.

DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14723904    14723924    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14745586    14745606    - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    7944241     7944261     - 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    8402856     8402876     + 
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr8    10864708    10864728    + 

Here is the script as I've edited it so far, but no luck:

#!/usr/bin/env perl
use strict;
use warnings;

# Load the files
my $filter = $ARGV[0];
my $sam = $ARGV[1];
open FILE1, $filter;
   if (! open FILE1, $filter) {die "Can't open filterfile: $!";}
open FILE2, $sam;
   if (! open FILE2, $sam) {die "Can't open samfile: $!";}

# build hash of keys using lines from the filter file
my $lines;
my %keys
while (<FILE1>) {
   chomp $lines;
   %keys = $lines;
}
close FILE1;

# look up keys in the map file, if match, print line in the map file.
my $samlines;
while (<FILE2>) {
   chomp $samlines;
   my ($id, $chr, $start, $stop, $strand)  = split /\t/, $samline;
   if (defined $lines->{$id}) { print "$samline \n"; }
}
Community
  • 1
  • 1
  • 2
    [`grep` is absurdly fast](https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html). Maybe you could write something faster, but ultimately you're still reading stuff linearly from a disk. I would instead consider instead putting the data in a database. – Schwern Nov 26 '15 at 05:07
  • Yes, and typically I use grep -f for similar tasks. In fact, I've been running it while trying to troubleshoot something that will run faster. However, 24 hours later, grep -f still hasn't completed one of the jobs at hand. Trying to make a respectable attempt to come up with something speedier. – RedPandaSpaceOdyssey Nov 26 '15 at 20:57
  • 1
    It would probably be speedier to put it in a database and do the queries there. (Also, great name) I would suggest trying that approach in parallel with your own. – Schwern Nov 26 '15 at 21:36
  • The limiting factor on file IO is almost always the file IO. Doesn't really matter what tool you use - disks only spin so fast. Optimising might be doable by e.g. preloading files into memory/database/faster disk. – Sobrique Nov 27 '15 at 10:01

2 Answers2

5

You don't seem to have made a real attempt at solving this problem yourself. The code you show won't even compile

There are a few reasons why it's not working

  • You are using file read loops with implicit control variables which read each line into $_, but you are somehow expecting data to appear in $lines and $samlines. You are also using $samline which you don't even declare

  • The line

    my %keys
    

    needs a semicolon at the end

  • I don't know what you expect to be in $lines, but assigning a scalar value to a hash like this

    %keys = $lines;
    

    will produce the warning Odd number of elements in hash assignment and leave you with a hash with only a single element

Here is a Perl program that does what I believe you intended, but I can't say whether it will be significantly faster than command_line grep. Note that I have used the autodie pragma instead of explicitly testing the status of each file IO operation

#!/usr/bin/env perl

use strict;
use warnings;
use v5.10.1;
use autodie;

my ($filter_f, $sam_f) = @ARGV;

my %filter;

{
    open my $fh, '<', $filter_f;

    while ( <$fh> ) {
        $filter{$1} = 1 if /(\S+)/;
    }
}

open my $fh, '<', $sam_f;

while ( <$fh> ) {
    print if /(\S+)/ and $filter{$1};
}

output

DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14723904    14723924    -
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr18   14745586    14745606    -
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    7944241     7944261     -
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr4    8402856     8402876     +
DB775P1:276:C2R0WACXX:2:1101:10000:77052     chr8    10864708    10864728    +
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thanks a bunch for this! In my defense, it's the first time I've ever tried to write anything in Perl. Ever. – RedPandaSpaceOdyssey Nov 26 '15 at 21:00
  • The code works swell on the test data, but when i run it on the full files, I get: `'perl: warning: Falling back to the standard locale ("C"). Out of memory! Out of memory! perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LANG = "en_US.UTF-8" are supported and installed on your system.` – RedPandaSpaceOdyssey Nov 26 '15 at 21:14
  • 1
    @RedPandaSpaceOdyssey How big is that filter file? – Schwern Nov 26 '15 at 21:37
  • There are 48 map files with one filter file for each. The filter files range from 6.9 to 31M, and the map files range from 400M to 3.5G. – RedPandaSpaceOdyssey Nov 29 '15 at 05:09
  • Then I don't see how your program can be producing an *Out of memory!* error. The filter hash should take up no more than about 300MB, and the map file is being read one line at a time. Have you written something different from my solution? – Borodin Nov 30 '15 at 23:35
0

So, Borodin's proposed script does in fact work. However, I found that my files were too large for it to complete. Instead, I sorted both files using 'sort' and then followed that up with join.

join -1 1 -2 1 filter.file map.file > filtered.map

For each of the 48 jobs, I reserved 16G of RAM and 8 processors.

Thank you everyone for your help with this!