I have a very large unsorted file, 1000GB, of ID pairs
- ID:ABC123 ID:ABC124
- ID:ABC123 ID:ABC124
- ID:ABC123 ID:ABA122
- ID:ABC124 ID:ABC123
- ID:ABC124 ID:ABC126
I would like to filter the file for
1) duplicates
example
ABC123 ABC124
ABC123 ABC124
2) reverse pairs (discard the second occurrence)
example
ABC123 ABC124
ABC124 ABC123
After filtering, the example file above would look like
- ID:ABC123 ID:ABC124
- ID:ABC123 ID:ABA122
- ID:ABC124 ID:ABC126
Currently, my solution is this
my %hash;
while(my $line = <FH>){
chomp $line; #remove \n
my ($id1,$id2) = split / /, $line;
if(exists $hash{$id1$1d2} || exists $hash{$id2$id1}){
next;
}
else{
$hash{$id1$id2} = undef ; ## store it in a hash
print "$line\n";
}
}
which gives me the desired results for smaller lists, but takes up too much memory for larger lists, as I am storing the hash in memory.
I am looking for a solution that will take less memory to implement. Some thoughts I have are
1) save the hash to a file, instead of memory
2) multiple passes over the file
3) sorting and uniquing the file with unix sort -u -k1,2
After posting on stack exchange cs, they suggested an external sort algorithm