I'm adapting an existing perl script proposed here: Fast alternative to grep -f
I need to filter many very large files (Map file), each ~10 million lines long x 5 fields wide using an also long list (Filter file) and print lines in the map file that match. I tried using grep -f, but it was simply taking too long. I read that this approach will be quicker.
This is what my files look like:
Filter file:
DB775P1:276:C2R0WACXX:2:1101:10000:77052
DB775P1:276:C2R0WACXX:2:1101:10003:51920
DB775P1:276:C2R0WACXX:2:1101:10004:36433
DB775P1:276:C2R0WACXX:2:1101:10004:57256
Map file:
DB775P1:276:C2R0WACXX:2:1101:10000:70401 chr5 21985760 21985780 -
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr18 14723904 14723924 -
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr18 14745586 14745606 -
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr4 7944241 7944261 -
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr4 8402856 8402876 +
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr8 10864708 10864728 +
DB775P1:276:C2R0WACXX:2:1101:10002:88487 chr17 5681227 5681249 -
DB775P1:276:C2R0WACXX:2:1101:10004:74842 chr13 2569168 2569185 +
DB775P1:276:C2R0WACXX:2:1101:10004:74842 chr14 13253418 13253435 -
DB775P1:276:C2R0WACXX:2:1101:10004:74842 chr14 13266344 13266361 -
I expect the output lines to look like this, because they contains the string present in both the map and filter files.
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr18 14723904 14723924 -
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr18 14745586 14745606 -
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr4 7944241 7944261 -
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr4 8402856 8402876 +
DB775P1:276:C2R0WACXX:2:1101:10000:77052 chr8 10864708 10864728 +
Here is the script as I've edited it so far, but no luck:
#!/usr/bin/env perl
use strict;
use warnings;
# Load the files
my $filter = $ARGV[0];
my $sam = $ARGV[1];
open FILE1, $filter;
if (! open FILE1, $filter) {die "Can't open filterfile: $!";}
open FILE2, $sam;
if (! open FILE2, $sam) {die "Can't open samfile: $!";}
# build hash of keys using lines from the filter file
my $lines;
my %keys
while (<FILE1>) {
chomp $lines;
%keys = $lines;
}
close FILE1;
# look up keys in the map file, if match, print line in the map file.
my $samlines;
while (<FILE2>) {
chomp $samlines;
my ($id, $chr, $start, $stop, $strand) = split /\t/, $samline;
if (defined $lines->{$id}) { print "$samline \n"; }
}