1

I need to merge two files into a new file.

The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren't sorted. The second file may have records the first file does not.

Sample File 1:

1001234|X15X1211,J,S,12,15,100.05

Sample File 2:

1231112|AJ32,,,18,JP     
1001234|AJ15,,,16,PP

Output:

1001234,X15X1211,J,S,12,15,100.05,AJ15,,,16,PP

I am using following piece of code:

tie %hash_REP, 'Tie::File::AsHash', 'rep.in', split => '\|'
my $counter=0;
while (($key,$val) = each %hash_REP) {
    if($counter==0) {
        print strftime "%a %b %e %H:%M:%S %Y", localtime;
    }
}

it takes almost 1 hour prepare associative array. is it really good or is it really bad? Is there any faster way to handle such size of records in associative array? Any suggestion in any scripting language would really help.

Thanks, Nitin T.

I also tried the following program, walso took 1+ Hour is as below:

#!/usr/bin/perl
use POSIX qw(strftime);
my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "\n";

my %hash;
open FILE, "APP.in" or die $!;
while (my $line = <FILE>) {
     chomp($line);
      my($key, $val) = split /\|/, $line;
      $hash{$key} = $val;
 }
 close FILE;

my $filename = 'report.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
open FILE, "rep.in" or die $!;
while (my $line = <FILE>) {
      chomp($line);
  my @words = split /\|/, $line;
  for (my $i=0; $i <= $#words; $i++) {
    if($i == 0)
    {
       next;
    }
    print $fh  $words[$i] . "|^"
  }
  print $fh  $hash{$words[0]} . "\n";
 }
 close FILE;
 close $fh;
 print "done\n";

my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "\n";
ikegami
  • 367,544
  • 15
  • 269
  • 518
Nitin Tripathi
  • 1,224
  • 7
  • 17
  • Please do not tag languages that have nothing to do with your question. If you are asking for this Perl code to be translated into Python, that is off-topic for Stack Overflow. – Cory Kramer Mar 02 '17 at 17:37
  • wanted suggestion in any (perl/python) scripting language – Nitin Tripathi Mar 02 '17 at 17:38
  • 3
    This review echoes your poor performance results: http://cpanratings.perl.org/dist/Tie-File-AsHash – toolic Mar 02 '17 at 17:40
  • @CoryKramer, its just a common problem in perl and python of handling huge associative array, so optimized approach would be helpful. – Nitin Tripathi Mar 02 '17 at 17:41
  • 2
    Do you insist on tie-ing it? Are there particular reasons for that? – zdim Mar 02 '17 at 17:41
  • @zdim: No, I was browsing for optimal approach and found this one. though, I tried similar approach of opening file and reading it line by line and inserting element in associative array. but it also took almost 1 hour to process it. :( – Nitin Tripathi Mar 02 '17 at 17:44
  • 2
    The fact you're iterating your hash using `each` suggests to me that you don't actually need to use an on-disk data structure in the first place. What problem are you _actually_ trying to solve here? – Sobrique Mar 02 '17 at 17:44
  • 1
    @Nitin Tripathi, Anything providing a tie-based interface will inherently be slower than if it didn't. – ikegami Mar 02 '17 at 17:46
  • @Sobrique: I have two files with over 300 Millions records, with first column as primary key in random order, now I have merge the data in new file. can have some extra records in 2nd file. – Nitin Tripathi Mar 02 '17 at 17:47
  • 1
    @NitinTripathi That is indeed _a lot_ to keep in RAM -- do you have to have it all at once? If you are going to work with it more than once, perhaps feed it to a database first? – zdim Mar 02 '17 at 17:49
  • My basic question would be, is 1 hour to process 300 Million records is really good or really bad? – Nitin Tripathi Mar 02 '17 at 17:50
  • 1
    @NitinTripathi It's really bad. Take an average disk read speed to be 100Mb/s. You seem to have some 20Gb there, so it'd be 200sec to read that. Three minutes or so – zdim Mar 02 '17 at 17:53
  • 4
    I would import both files into an sqlite and let it deal with it. Let it be on disk, let it take some time, but don't try to load everything into memory. – simbabque Mar 02 '17 at 18:05
  • @zdim: data is extracted from DB, as final extract which is needed has to be joined over 15 tables with some inner queries and maximum parallel that can be used is only 8. which is taking almost 1.5 hours to process data. Oracle cannot be locked be locked for 1.5 hours in production env, so we decided to go with file handling. – Nitin Tripathi Mar 02 '17 at 18:07
  • 1
    Can you [edit] your question and show how you want to combine them? Include a few lines of both files and show what the result should be. – simbabque Mar 02 '17 at 18:08
  • 1
    Could you add an example of a row from each file, and the row they would produce when merged. Also, could you mention if there's an upper limit on the size of the fields? – ikegami Mar 02 '17 at 18:12
  • @ikegami: Added example – Nitin Tripathi Mar 02 '17 at 18:25
  • And `1231112` from the _Sample File 2_ is not part of the output? – simbabque Mar 02 '17 at 18:31
  • *"with first column as primary key"* Text files don't have keys, primary or otherwise. – Borodin Mar 02 '17 at 19:31
  • How many different values for the first columns are there? – Borodin Mar 02 '17 at 19:31
  • @Borodin: First Column will be a unique value. – Nitin Tripathi Mar 02 '17 at 23:12
  • @simbabque: Since File1 is the driving file, only records matched with first file will be printed. – Nitin Tripathi Mar 02 '17 at 23:14

2 Answers2

6

Your technique is extremely inefficient for a few reasons.

  • Tying is extremely slow.
  • You're pulling everything into memory.

The first can be mitigated by doing the reading and splitting yourself, but the latter is always going to be a problem. The rule of thumb is to avoid pulling big hunks of data into memory. It'll hog all the memory and probably cause it to swap to disk and slow down waaaay down, especially if you're using a spinning disk.

Instead, there's various "on disk hashes" you can use with modules like GDBM_File or BerkleyDB.

But really there's no reason to mess around with them because we have SQLite and it does everything they do faster and better.


Create a table in SQLite.

create table imported (
    id integer,
    value text
);

Import your file using the sqlite shell's .import adjusting for your format using the .mode and .separator.

sqlite>     create table imported (
   ...>         id integer,
   ...>         value text
   ...>     );
sqlite> .mode list
sqlite> .separator |
sqlite> .import test.data imported
sqlite> .mode column
sqlite> select * from imported;
12345       NITIN     
12346       NITINfoo  
2398        bar       
9823        baz     

And now you, and anyone else who has to work with the data, can do whatever you like with it in efficient, flexible SQL. Even if it takes a while to import, you can go do something else while it does.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • 2
    Re "*You're pulling everything into memory.*", No, the whole point of Tie::File::AsHash is that it *doesn't* pull everything into memory. It's a thin wrapper for Tie::File. – ikegami Mar 02 '17 at 19:50
  • 2
    Re "*Tying is extremely slow*", That's not the problem either. The problem is that fetching from a Tie::File::AsHash hash peforms a linear scan of a Tie::File array, which you means you read half the file (on average) every time you lookup a hash element!!! – ikegami Mar 02 '17 at 19:58
  • 2
    For me, this took twice as long as using `sort` plus merging with Perl. See my answer. – ikegami Mar 02 '17 at 19:58
  • @ikegami After they've read them from the file, they're storing all the keys and values in memory. And yes, I'm sure there's other performance issues. Anyhow, the wallclock performance doesn't really matter. This technique requires very little coding or debugging, so you can set it and walk away, and it only has to be done once. Once it's done, you've got it in a very useful and efficient format for further manipulation. [I have an answer that lays out the CSV vs SQLite trade offs](http://stackoverflow.com/questions/40695905/sqlite-vs-csv-file-manipulation-performance/40696615#40696615). – Schwern Mar 02 '17 at 20:02
  • @ikegami Oh, I see, "twice as long" means 30 seconds vs a 1 minute. How long did it take you to write the code? :P – Schwern Mar 02 '17 at 20:04
  • 1
    Re "*After they've read them from the file, they're storing all the keys and values in memory*", No. Again, the whole point of using Tie::File::AsHash is that it doesn't do this. – ikegami Mar 02 '17 at 20:23
  • Re "*"twice as long" means 30 seconds vs a 1 minute.*", Actually, if we extrapolate for 300 million rows, it's 15 minutes vs 30 minutes. /// Re "*How long did it take you to write the code?*", Far less than writing the `sqlite3` script, actually, but it doesn't matter. This is obviously something that will run regularly, or the fact that it previously *completed* in three hours wouldn't matter. – ikegami Mar 02 '17 at 20:23
6

I'd use sort to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.

perl -e'
   sub get {
      my $fh = shift;
      my $line = <$fh>;
      return () if !defined($line);

      chomp($line);
      return split(/\|/, $line);
   }

   sub main {
      @ARGV == 2
         or die("usage\n");

      open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
      open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);

      my ($key1, $val1) = get($fh1)  or return;
      my ($key2, $val2) = get($fh2)  or return;

      while (1) {
         if    ($key1 < $key2) { ($key1, $val1) = get($fh1)  or return; }
         elsif ($key1 > $key2) { ($key2, $val2) = get($fh2)  or return; }
         else {
            print("$key1,$val1,$val2\n");
            ($key1, $val1) = get($fh1)  or return;
            ($key2, $val2) = get($fh2)  or return;
         }
      }
   }

   main();
' file1 file2 >file

For 10,000,000 records in each file, this took 37 seconds on a slowish machine.

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2

$ time perl -e'...' file1 file2 >file
real    0m37.030s
user    0m38.261s
sys     0m1.750s

Alternatively, one could dump the data in database and letting it handle the details.

sqlite3 <<'EOI'
CREATE TABLE file1 ( id INTEGER, value TEXT );
CREATE TABLE file2 ( id INTEGER, value TEXT );
.mode list
.separator |
.import file1 file1
.import file2 file2
.output file
SELECT file1.id || "," || file1.value || "," || file2.value
  FROM file1
  JOIN file2
    ON file2.id = file1.id;
.exit
EOI

But you pay for the flexbility. This took twice as long.

real    1m14.065s
user    1m11.009s
sys     0m2.550s

Note: I originally had CREATE INDEX file2_id ON file2 ( id ); after the .import commands, but removing it greatly helped performance..

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • For `10M` Record, its really efficient even it took `1.30 min` to process it. But when I went further for `300M` Records, it crashed as Temp space on my machine was `5GB` while, the actual file was `10GB+`, as sorting took place in Temp space, it crashed. After removing the sorting part, and keeping only merging part, provided file is already sorted. It took `42 Min` to Merge, keeping in mind out of `32GB` only `2GB` was available and it ran with `50% CPU Usage`. After killing all process, It took 24 Min, with `93% CPU Usage`. It seems, for 93% CPU Usage, its still slower!! – Nitin Tripathi Mar 08 '17 at 08:54
  • Even I tried the splitting the file into `30M` Each and it took `2 Minutes` each. But when I ran all `10 process` in parallel, it took `12 Minutes` all together cumulative but, actually each process in real took `2 Min` individually. – Nitin Tripathi Mar 08 '17 at 08:58