Improving a sort program's efficiency; displaying column sort % completion in terminal

Question

I have a big pipe-delimited input file approx 6 million lines as below:

24|BBG000SJFVB0|EQ0000000009296012|OI SA-ADR|OIBR/C|US|ADR|Equity 16|BBG002PHVB83|EQ0000000022353186|BLOOM SELECT INCOME FUND|BLB-U|CT|Closed-End Fund|Equity
-50|BBG000V0TN75|EQ0000000010271114|MECHEL-PREF SPON ADR|MTL/P|US|ADR|Equity 20|BBG002S0ZR60|EQ0000000022739316|DIVIDEND 15 SPLIT CORP II-RT|DF-R|CT|Closed-End Fund|Equity
-20|BBG001R3LGM8|EQ0000000017879513|ING FLOATING RATE SENIOR LOA|ISL/U|CT|Closed-End Fund|Equity 0|BBG006M6SXL2|EQ0000000006846232|AA PLC|AA/|LN|Common Stock|Equity

Requirements are as below:
1. I need to sort this input file by 1st column and then 2nd column and then 2nd last column in that order
2. Displaying % of sort completion in terminal/console for e.g. "column 2 75% sort done"
3. finally output in a separate file.

I have written the program below which is sorting by 1st column perfectly. But how to incorporate the all other conditions? Also now it is taking a little more time to run. Is there any more efficient and cleaner way to do it? Only thing is we can't use any additional outside package from CPAN. Unix solutions like using SED/AWK are OK but Perl is preferable.I just came to know built-in Python is also there so that solution is also welcome.

my (%link_strength);
{$data="datascope_input.txt";
 $out="sort_file.txt";
open (my $indata , '<', $data)|| die "could not open $data :\n$!";
open (my $outdata , '>', $out)|| die "could not open $out :\n$!";
select $outdata;
my @array=(<$indata>);
for (@array){
    $link_strength{$1}=$_  if /(?:[^|]+\|){0}([^|]+)/;
            }
print $link_strength{$_} for (sort {$a<=>$b} keys %link_strength);
  close ($outdata);
  close ($indata);
}

A system sort like this http://man7.org/linux/man-pages/man1/sort.1.html is much better optimized for data of this size than reading the whole set into a perl array. With the right options, it will solve your problem neatly except that there is no clean way to get a percent complete indicator either in perl or unix sort. — Gene, Jun 20 '15 at 04:23
@Gene, I am not looking clean way...but any way how to code up for % complete indicator — pmr, Jun 20 '15 at 04:36
How do you define, and then measure, percentage complete? Using a built-in sort function is likely to give you problems in measuring the completeness of the job. — Jonathan Leffler, Jun 20 '15 at 05:01
@JonathanLeffler I got a script here: http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file MAX_LINES_PER_CHUNK=1000000 ORIGINAL_FILE=$1 SORTED_FILE=$2 CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split. SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted so based on this CHUNKS is something possible ? — pmr, Jun 20 '15 at 05:34
I'm not sure what you're looking for. I suspect not, even so. Monitoring progress is not easy. — Jonathan Leffler, Jun 20 '15 at 05:45

Gene · Answer 1 · 2015-06-20T05:19:34.870

2

As I said in comments, the Linux/Unix system sort is likely to perform better, but if you really want Perl, this will do the trick:

use strict;

sub main {
  open F, 'input.txt' or die $!;
  my @pairs;
  while (<F>) {
    my @fields = split(/\|/);
    my $key = [ @fields[0, 1, -2] ];
    push @pairs, [$key, $_];
  }
  close F;
  my @sorted_pairs = sort {
    my $a_key = $a->[0];
    my $b_key = $b->[0];
    $a_key->[0] <=> $b_key->[0]
      || $a_key->[1] cmp $b_key->[1] 
      || $a_key->[2] cmp $b_key->[2]
  } @pairs;
  foreach my $pair (@sorted_pairs) {
    print $pair->[1];
  }
}

main;

Also as I said in comments, I know of no way to introspectively gather progress information. You could hack something by counting how many comparisons have occurred, but since you'll never be sure of the final number, a percent complete can't be calculated.

edited Jun 20 '15 at 05:19

answered Jun 20 '15 at 05:08

Gene

46,253
4
58
96

@Gene..very good. I am just curious here "counting how many comparisons have occurred, but since you'll never be sure of the final number"...if we can count then why we are not sure about the number ? – pmr Jun 20 '15 at 05:13
@purnendumaity: You can know how many comparisons have been made; you might be able to make a guess at how many comparisons will be made, but for all except the simplest (and most inefficient) sorts, that will probably be an upper-bound. And in any case, unless you've written the sort to be instrumented, you won't be able to determine how many comparisons it expects to make. – Jonathan Leffler Jun 20 '15 at 05:37

score 0 · Accepted Answer · answered Jun 20 '15 at 13:01

From your sample data, you are going to sort approximately 950MB. It will take 9.5s reading from normal HD (100MB/s). I do not know how fast it will be sorted by standard sort but from my experience it can go 1-3 millions of records per CPU core. Let's say 1 million. It will take 3s on dual core and less on a server with more CPU cores. I think the most time will take reading and parsing of your data. So simple

pv -p your_file.dat | sort -t'|' -k '1n,1' -k '2d,2' -k '14,14'

should do most of required functionality.

Improving a sort program's efficiency; displaying column sort % completion in terminal

2 Answers2