Comparing value range in 2 columns of 2 files

Question

I have 2 big files (tab delimited).

first file ->

Col1           Col2    Col3 Col4     Col5        Col6       Col7    Col8
101_#2          1       2    F0       263        278        2       1.5
102_#1          1       6    F1       766        781        1       1.0
103_#1          2       15   V1       526        581        1       0.0
103_#1          2       9    V2       124        134        1       1.3
104_#1          1       12   V3       137        172        1       1.0
105_#1          1       17   F2       766        771        1       1.0

second file ->

Col1    Col2    Col3             Col4
97486   9   262               279
67486   9   118           119
87486   9   183           185
248233  9   124           134

I want to compare col5 and col6 of file 1(like a range value) with col3 and col4 of file2. If the range of file 1 is present in file 2 then return that row (from file1).

Expected output ->

Col1        Col2    Col3 Col4     Col5        Col6       Col7    Col8
101_#2        1       2    F0       263        278        2       1.5
103_#1        2       9    V2       124        134        1       1.3

So far I have tried ->

@ARGV or die "No input file specified";

open my $first, '<',$ARGV[0] or die "Unable to open input file: $!";
open my $second,'<', $ARGV[1] or die "Unable to open input file: $!";


print scalar (<$first>);

while (<$first>) {
    @cols = split /\s+/;
    $p1 = $cols[4];
    $p2 = $cols[5];

   while(<$second>) {
   @sec=split /\s+/;
   print join("\t",@cols),"\n" if ($p1>=$sec[2] && $p2<=$sec[3]);
}

}

But this is working only for first row. Also the files are very big (around 6gb).

I just tried something with hashes.

@ARGV or die "No input file specified";
open my $first, '<',$ARGV[0] or die "Unable to open input file: $!";
open my $second,'<', $ARGV[1] or die "Unable to open input file: $!";
print scalar (<$first>);
while(<$second>){
chomp;
@line=split /\s+/;
$hash{$line[2]}=$line[3];
}
while (<$first>) {
    @cols = split /\s+/;
    $p1 = $cols[4];
    $p2 = $cols[5];
foreach $key (sort keys %hash){

if ($p1>= "$key"){
if ($p2<=$hash{$key})
{
print join("\t",@cols),"\n";
}
}
else{next;}
}
}

But this is also taking a lot of time and memory.Can anybody suggest how I can make it fast using hashes.Thanks a lot.

First file containing these ranges is also on the gigabyte size? In that case you probably want to put this data (one of the files would be enough) into some database and query it to get your results. — Qtax, Jan 30 '12 at 13:44
I was saying size in total. Exactly my first file is around 3 gb and 2nd file around 1 gb. — Vikas, Jan 30 '12 at 13:46

AKALINUX · Answer 1 · 2012-04-22T00:36:06.013

Take a look at http://search.cpan.org/dist/Data-Range-Compare-Stream/lib/Data/Range/Compare/Stream.pod

Here is an example based on your source files. The amazing thing is the perl script never gets bigger than a few mb in memory no matter how big the source files are! Just make sure you have Data::Range::Compare::Stream version 3.023 or higher!

Notes:

This script does a sort of your input files using an on disk merge sort. The on disk merge sort can take a long time with really big files. You can tune the performance by adjusting bucket_size argument to the Data::Range::Compare::Stream::Iterator::File::MergeSortAsc constructor. See: http://search.cpan.org/dist/Data-Range-Compare-Stream/lib/Data/Range/Compare/Stream/Iterator/File/MergeSortAsc.pod#OO_Methods for more details.

use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::File::MergeSortAsc;
use Data::Range::Compare::Stream::Iterator::Compare::Asc;
use Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn;

my $cmp=new Data::Range::Compare::Stream::Iterator::Compare::Asc;

sub parse_file_one {
  my ($line)=@_;
  my @list=split /\s+/,$line;
  return [@list[4,5],$line]
}

sub parse_file_two {
   my ($line)=@_;
   my @list=split /\s+/,$line;
   return [@list[2,3],$line]
}

sub range_to_line {
  my ($range)=@_;
  return $range->data;
}

my $file_one=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
  result_to_line=>\&range_to_line,
  parse_line=>\&parse_file_one,
  filename=>'custom_file_1.src',
);

my $file_two=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
  result_to_line=>\&range_to_line,
  parse_line=>\&parse_file_two,
  filename=>'custom_file_2.src',
);

my $set_one=new Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn(
  $file_one,
  $cmp
);

my $set_two=new Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn(
  $file_two,
  $cmp
);

$cmp->add_consolidator($set_one);
$cmp->add_consolidator($set_two);

while($cmp->has_next) {
  my $result=$cmp->get_next;
  next if $result->is_empty;

  my $ref=$result->get_root_results;
  next if $#{$ref->[0]}==-1;
  next if $#{$ref->[1]}==-1;

  foreach my $overlap (@{$ref->[0]}) {
    print $overlap->get_common->data;
  }

}

The only quirk is the output would be in a different order:

103_#1          2       9    V2       124        134        1       1.3
101_#2          1       2    F0       263        278        2       1.5

Borodin · Answer 2 · 2012-01-30T15:53:43.837

1

You're trying to read through the second file again when it's already at end of file. To get this to work you need to write seek $second, 0, 0 just before the inner while loop.

However this method will be extremely slow, and it would improve things vastly if you were to read all the ranges from the second file into memory first. This code does that. I suggest you try it to see if it will work within your available memory.

use strict;
use warnings;

use List::Util;

my @ranges;

open my $fh, '<', 'f2.txt' or die $!;

while (<$fh>) {
  my ($beg, $end) = (split)[2,3];
  next if $beg =~ /\D/ or $end =~ /\D/;
  push @ranges, [$beg, $end];
}

open $fh, '<', 'f1.txt' or die $!;

while (<$fh>) {
  my ($beg, $end) = (split)[4,5];
  next if $beg =~ /\D/ or $end =~ /\D/;
  print if first { $beg >= $_->[0] and $end <= $_->[1] } @ranges;
}

edited Jan 30 '12 at 15:53

answered Jan 30 '12 at 15:08

Borodin

126,100
9
70
144

Thanks a lot. It is working fine but as you said, it is very slow and taking a lot of memory. Can you please suggest something fast? may be with hashes (but I am not familiar with them well). – Vikas Jan 30 '12 at 15:29
I have altered my solution to use `List::Util` which will speed things up a little because it is written in C instead of Perl, and also stops scanning the ranges array when it finds a match instead of looking all the way through. If this is still unusably slow then you will have to optimise the list of ranges by sorting it so that you can do a binary search, or perhaps combining overlapping ranges if those could exist. Otherwise you must resort to putting the data into a database as others have suggested. – Borodin Jan 30 '12 at 15:58
What is the maximum value of the upper end of a range in the second file? Is it in the few hundreds as shown in your example data? – Borodin Jan 30 '12 at 16:00
No no.It can be much larger. Upto 8 or 9 digits. – Vikas Jan 30 '12 at 16:04

modi2020 · Answer 3 · 2017-07-03T12:00:15.883

Another solution that I found greatly speeds things up is using a subroutine: Suppose you are comparing the first and the second columns of both files, which was my intention. First, you need to sort both files by the first, then the second column.You thenread the first file ranges into an array and call a subroutine to do the matching in the second file and write the matching lines to a file while when a match is found. In the subroutine, you also save the line number where the last match was found so that perl directly goes to that line without delay! Note that I start by the first line in the second file.


use warnings;
use strict;

open my $first,  '<', "first_file.txt" or die$!; 
open my $second, '<', "second_file" or die$!;
 open output, ">output.txt" or die$!;

my $line_number=1;

foreach (<$first>) {
my  @cols=();
chomp $_;
   my  @cols = split( /\s+/, $_ );
   my $p1   = $cols[0];
   my $p2   = $cols[1];
   match($p1,$p2,$line_number);
}


sub match{
 while  (<$second>) {
    next if ($. < $line_number);
    chomp $_;
    my @list = @_;
    my $p1=(@list[0]);
    my $p2=(@list[1]);
    my $line_number=(@list[2]);
         my @sec = split( /\s+/, $_ );
       if ( $p1 == $sec[0] && $p2 == $sec[1] ) { 
       print output2 $_."\n"; 
       return $line_number;
       next;}

       } }

JRFerguson · Answer 4 · 2012-01-30T15:12:08.400

0

Your reading the whole second file as soon as you read the first file's second record. Change:

while(<$second>) {

to something like:

if (defined($_ = <$second>)) {

So you have:

#!/usr/bin/env perl
use strict;
use warnings;
my ( @cols, $p1, $p2, @sec );
@ARGV or die "No input file specified";
open my $first , '<',$ARGV[0] or die "Unable to open input file: $!";
open my $second,'<', $ARGV[1] or die "Unable to open input file: $!";
print scalar <$first>;
<$second>; #...throw away first line...
while (<$first>) {
    @cols = split /\s+/;
    $p1   = $cols[4];
    $p2   = $cols[5];

    if (defined($_ = <$second>)) {
        @sec=split /\s+/;
        print join("\t",@cols),"\n" if ($p1>=$sec[2] && $p2<=$sec[3]);
    }
}

edited Jan 30 '12 at 15:12

answered Jan 30 '12 at 14:35

JRFerguson

7,426
2
32
36

Thanks for reply. But I tried this, now it is not even giving first row.Is there any way to use while as nested while and read file again and again? – Vikas Jan 30 '12 at 14:46
@Vikas: I've edited my reply to show a full, working version. – JRFerguson Jan 30 '12 at 15:13
Thanks. I just tried to put seek before inner while loop (by Borodin) and it is working fine. But the code is very slow and taking a lot of memory. Can you please suggest something fast? – Vikas Jan 30 '12 at 15:26

score 0 · Answer 5 · edited May 23 '17 at 10:34

This is basic 'query optimization' in the sense that an SQL optimizer does it. You have a variety of options.

One option is to read File1 a line at a time, and to read through File2 for each line of File1, printing the matching data. Clearly, this is slow. It is not the slowest way: that reads each line of File2 in turn and scans File1 (the larger file) for matches. This technique works regardless of the order of the content in the files.

Another option, which does not rely on the data being ordered either, is to read the smaller file into memory, and then read the larger file a line at a time, pulling the matching data. In the simplest form, you use a linear search of the in-memory data; it would be better to organize it so that it is possible to stop the search through the in-memory data more quickly (perhaps sorted by Col3 values, secondarily on Col4 values).

If the data on disk is already appropriately sorted, then you may be able to do without one of the files in memory, and simply do a merge-like operation on the files. You'd probably want File1 sorted in order of Col5 (secondarily on Col6), while File2 would be sorted in order of Col3 and Col4. This reduces the amount of data in memory, at the cost of pre-sorting the data. You'd need to think this through carefully: you're aiming to avoid reading too much data into memory, but because the matching condition is on ranges, you are likely to need to keep some number of lines from at least one of the files in memory for reuse.

If you have memory enough and the data is not pre-sorted, you might decide to read both files into memory, sort appropriately, and then do the merge options.

Since you are sorting in ranges, you might, in theory, get involved with a R-Tree indexing mechanism instead. However, that would probably be overkill for a couple text files, unless you're going to be doing this often.

Finally, since I identified this as the sort of thing the SQL optimizers do all the time, you might do best to load an actual database with the data and then run the query:

SELECT F1.*, F2.*
  FROM File1 AS F1 JOIN File2 AS F2
    ON F1.Col5 <= F2.Col4 AND F1.Col6 >= F2.Col3

The condition tests that F1.Col5 .. F1.Col6 overlaps with F2.Col3 .. F2.Col4. It assumes that if you have [129..145] and [145..163], then you need the match. If that's incorrect, adjust the <= and >= appropriately. See How do I compare overlapping values in a row and more particularly Determine whether two date ranges overlap. Although both are talking about dates and times, the answers apply to numerical ranges just as well (or to any other range).

Of the options outlined, the simplest one with a reasonable performance characteristic is the second:

Read the smaller file into memory and organize for rapid access, then scan the larger file one line at a time.

However, if there are memory constraints, or time constraints, that prevent this working, then you'll need to choose one of the other mechanisms.

score 0 · Answer 6 · answered Jan 30 '12 at 15:17

This seems to work just fine (and it is fairly close to your original code)


@ARGV or die "No input file specified";

open my $first,  '<', $ARGV[0] or die "Unable to open input file: $!";
open my $second, '<', $ARGV[1] or die "Unable to open input file: $!";

print scalar(<$first>);

my $secondHeader = <$second>;

while (<$first>) {
    @cols = split /\s+/;
    $p1   = $cols[4];
    $p2   = $cols[5];

    my $secondLine = <$second>;
    if ( defined $secondLine ) {
        @sec = split( /\s+/, $secondLine );
        print join( "\t", @cols ), "\n" if ( $p1 >= $sec[2] && $p2 <= $sec[3] );
    }
}

score 0 · Answer 7 · answered Jan 30 '12 at 17:16

You realize with your double loop that you're creating an algorithm that's O² in efficiency. For example, if both file contain 100 lines each file, you'll be looping through your inner loop 10,000. If both files contain 1000 items, you'll be taking not 10 times longer, but 1000 times longer. If these files are as big as you claim, you're going to be waiting a long, long time for your program to complete.

Your best bet is to put your data into a SQL database (something that's made for dealing with large data sources).

Otherwise, you'll have to store your first file in a format where you can quickly search for the right range -- such as a binary tree.

Store your first file as a binary tree based upon the low range, but storing both the low and high range in the binary tree nodes for comparison.

For each line in the second file, you'd search through your binary tree for the correct lower range, compare the higher range, and if it's a match, you've found your node.

This is way too complex for me to write out a quick algorithm. However, there are several binary tree modules in CPAN which should make storing and searching your tree much easier. Unfortunately, I've never used one, so I can't make a recommendation. However, you should probably find a balanced tree algorithm like Tree::AVL.

Using such a structure is certainly more complex than your double loop, but it's much, much faster. With the efficiency will be a bit more than the size of the two files combined.

Another possibility is to sort the two files into two separate arrays. Perl's sorting algorithm is somewhere around OlogO which is much more efficient than a double loop, but not as efficient as building a binary tree. However, if the two files are more or less already in order, it'll be closer to the binary tree in efficiency and a lot faster to implement.

If you sort both arrays, you can go down sequentially in file #2, and find the row in file #1. Since both files are in order, you don't have to start at the beginning of file #1 when searching for the next matching row in file #2.

Hope this helps. Sorry about no coding examples.

Thanks a lot for your reply. The thing is I want to do this task in perl because I want to learn it. I am trying different ways also but for something more efficient I need the help of champs at stackoverflow. — Vikas, Jan 30 '12 at 17:24
@Vikas - If you're new to Perl, I'd go with sorting the two arrays and then walking through both of them together. It's not all that efficient, but it's certainly more efficient than doing a double loop, and the syntax is much simpler for a beginner to understand. — David W., Jan 30 '12 at 17:36

Comparing value range in 2 columns of 2 files

7 Answers7