1

I have two text files that look like:

col1 primary col3 col4
blah 1       blah  4
1    2       5     6
...

and

colA primary colC colD
1    1       7    27
foo  2       11   13

I want to merge them into a single wider table, such as:

primary  col1 col3 col4 colA colC colD
1        blah blah 4    a    7    27
2        1    5    6    foo  11   13

I'm pretty new to Perl, so I'm not sure what the best way is to do this. Note that column order does not matter, and there are a couple million rows. Also my files are unfortunately not sorted.

My current plan unless there's an alternative: For a given line in one of the files, scan the other file for the matching row and append them both as necessary into the new file. This sounds slow and cumbersome though.

Thanks!

AlexQueue
  • 6,353
  • 5
  • 35
  • 44

1 Answers1

-1
  • Solution 1.

    1. Read the smaller of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.

    2. Save each record (as arrayref of columns) in a hash, with your merge column being the hash key

    3. When done, read the larger of two files line by line, using a standard CPAN delimited-file parser like TXT::CSV_XS to parse out columns.

    4. For each record, find the join key field, find the matching record from your hash storing the data from file#1, merge the 2 records as needed, and print.

    NOTE: This is pretty memory intensive as the entire smaller file will live in memory, but won't require you to read one of the files million times.


  • Solution 2.

    1. Sort file1 (using Unix sort or some simple Perl code) into "file1.sorted"

    2. Sort file2 (using Unix sort or some simple Perl code) into "file2.sorted"

    3. Open both files for reading. Loop until both are fully read:

      • Read 1 line from each file into the buffer if the buffer for that file is empty (buffer being simply a variable containing the next record).

      • Compare indexes between 2 lines.

      • If index1 < index2, write the record for file1 into output (without merging) and empty buffer1. Repeat step 3

      • If index1 > index2, write the record for file2 into output (without merging) and empty buffer2. Repeat.

      • If index1 == index2, merge 2 records, write the merged record into output and empty out both buffers (assuming the join index column is unique. If not unique, this step is more complicated).

    NOTE: this does NOT require you to keep entire file in memory, aside from sorting the files (which CAN be done in memory constrained way if you need to).

DVK
  • 126,886
  • 32
  • 213
  • 327