Join multiple files into one using a key and rearrange the columns using perl.

Question

What approach should i take if i am trying to read multiple large files and join them using a key. There is a possibility of 1 to many combinations so reading one line at a time works for my simple scenario. Looking for some guidance. Thanks!

use strict;
use warnings;

open my $head, $ARGV[0] or die "Can't open $ARGV[0] for reading: $!";
open my $addr, $ARGV[1] or die "Can't open $ARGV[1] for reading: $!";
open my $phone, $ARGV[2] or die "Can't open $ARGV[2] for reading: $!";
#open my $final, $ARGV[3] or die "Can't open $ARGV[3] for reading: $!";


while( my $line1 = <$head> and my $line2 = <$addr> and my $line3 = <$phone>)
{
        #split files to fields
        my @headValues = split('\|', $line1);

        my @addrValues = split('\|', $line2);

        my @phoneValues = split('\|', $line3);


        # if the key matches, join them
        if($headValues[0]==$addrValues[0] and $headValues[0]==$phoneValues[0])
        {

        print "$headValues[0]|$headValues[1]|$headValues[2]|$addrValues[1]|$addrValues[2]|$phoneValues[1]";

        }

}
close $head;

how large is large? how many bytes, how many lines, in each file? which files, if any, can have more than one line with the same key value? — ysth, Nov 05 '13 at 15:21
Friendly hint: `use autodie;`, and then you don't have to do the `open or die` stuff. — David W., Nov 05 '13 at 15:29
millions of lines and files up to 16gb. head.txt is unique. All but head.txt can have more than one line with the same key. — user1090708, Nov 05 '13 at 15:29
maybe a bit more information about the file contents would be useful — Davs, Nov 05 '13 at 15:47
(1) Are the files ordered or clustered on the 1st column? (2) Is the order of rows important? Or may some rows be re-ordered? (3) How large is the head file compared with the other files? (4) How often will this code run? Once, or daily/weekly? (5) Are there specific hardware constraints, e.g. little memory, hard disk drives? (6) It might be helpful if you could express your intent as a SQL query. — amon, Nov 05 '13 at 15:49
They are sorted by key which is not always in the first column. Its not important, as long as the right key is presented, such as: — user1090708, Nov 05 '13 at 15:58
They are sorted by key which is not always in the first column. Its not important, as long as the right key is presented for each row, such as: 123|recordA_firstName1| 123|recordA_firstName2 456|recordB_firstName1 there can be more than one attribute per key. This will run once, no hardware constrains. Can not use SQL at this point. (could not edit the last comment..) — user1090708, Nov 05 '13 at 16:04

score 2 · Answer 1 · answered Nov 05 '13 at 16:24

I am not sure if it's exactly what you're looking for but did you try the UNIX command join? Consider these two files:

x.tsv

001 X1
002 X2
004 X4

y.tsv

002 Y2
003 Y3
004 Y4

the command join x.tsv y.tsv produces:

002 X2 Y2
004 X4 Y4

That is, it merges lines with the same ID and discard the others (to keep things simple).

score 0 · Answer 2 · answered Nov 05 '13 at 15:56

Trying to understand your files. You have one file of head values (whatever those are) one file filled with phone numbers, and one file filled with addresses. Is that correct? Each file can have multiple head, addresses, or phone numbers, and each file somehow corresponds to each other.

Could you give an example of the data in the files, and how they relate to each other? I'll update my answer as soon as I get a better understanding on what your data actually looks like.

Meanwhile, it's time to learn about references. References allow you to create more complex data structures. And, once you understand references, you can move onto Object Oriented Perl which will really allow you to tackle programming tasks that you didn't know were possible.

Perl references allow you to have hashes of hashes, arrays of arrays, arrays of hashes, or hashes of arrays, and of course those arrays or hashes in that array or hash can itself have arrays or hashes. Maybe an example will help.

Let's say you have a hash of people assigned by employee number. I'm assuming that your first file is employee_id|name, and the second file is address|city_state, and the third is home_phone|work_phone:

First, just read in the files into arrays:

use strict;
use warnings;
use autodie;
use feature qw(say);

open my $heading_fh, "<", $file1;
open my $address_fh, "<", $file2;
open my $phone_fh, "<", $file3;

my @headings = <$heading_fh>;
chomp @headings;
close $heading_fh;

my @addresses = <$address_fh>;
chomp @addresses;
close $address_fh;

my @phones = <$phone_fh>;
chomp @phones;
close $phone_fh;

That'll make it easier to manipulate the various data streams. Now, we can go through each row:

my %employees;
for my $employee_number (0..$#headings) {
    my ( $employee_id, $employee_name ) = split /\s*\|\s*/, $employees[$employee_number];
    my ( $address, $city ) = split  /\s*\|\s*/, $phones[$employee_number];
    my ( $work_phone, $home_phone ) = split /\s*\|\s*/, $addresses[$employee_number];
    my $employees{$employee_id}->{NAME} = $employee_name;
    my $employees{$employee_id}->{ADDRESS} = $address;
    my $employess{$employee_id}->{CITY} = $city;
    my $employees{$employee_id}->{WORK} = $work_phone;
    my $employees{$employee_id}->{HOME} = $home_phone;
}

Now, you have a single hash called %employees that is keyed by the $employee_id, and each entry in the hash is a reference to another hash. You have a hash of hashes.

The end result is a single data structure (your %employees) that are keyed by $employee_id, but each field is individually accessible. What is the name of employee number A103?, It's $employees{A103}->{NAME}.

Code is far from complete. For example, you probably want to verify that all of your initial arrays are all the same size and die if they're not:

if ( ( not $#employees == $#phones ) or ( not $#employees == $#addresses ) ) {
    die qq(The files don't have the same number of entries);
}

I hope the idea of using references and making use of more complex data structures makes things easier to handle. However, if you need more help. Post an example of what your data looks like. Also explain what the various fields are and how they relate to each other.

There are many postings on Stackoverflow are look like this to me:

My data looks like this:

ajdjadd|oieuqweoqwe|qwoeqwe|(asdad|asdads)|adsadsnrrd|hqweqwe

And, I need to make it look like this:

@#*()#&&###|@#*@#&)(*&!@!|@#@#&(*&@#

score 0 · Accepted Answer · edited May 23 '17 at 12:29

If I were you, then I would build an sqlite database from the three file then it would be much easier to use sql to retrive the results.

I did not know how fast it is going to be, but i think it is much robust than reading three files in paralel. SQlite could handle this amount of data.

http://perlmaven.com/simple-database-access-using-perl-dbi-and-sql

SQLite for large data sets?

#!/usr/bin/perl
use strict;
use warnings;

use DBI;

my $dbfile = "sample.db";

my $dsn = "dbi:SQLite:dbname=$dbfile";
my $user = "";
my $password = "";
my $dbh = DBI->connect($dsn, $user, $password, {
PrintError => 1,
RaiseError => 1,
FetchHashKeyName => 'NAME_lc',
AutoCommit => 0,
});
$dbh->do('PRAGMA synchronous = OFF');

my $sql = <<'END_SQL';
CREATE TABLE t1 (
id INTEGER PRIMARY KEY,
c1 VARCHAR(100),
c2 VARCHAR(100),
c3 VARCHAR(100),
c4 VARCHAR(100),
)
END_SQL

$dbh->do($sql);

my $sql = <<'END_SQL';
CREATE TABLE t2 (
id INTEGER PRIMARY KEY,
c1 VARCHAR(100),
c2 VARCHAR(100),
c3 VARCHAR(100),
c4 VARCHAR(100),
)
END_SQL

$dbh->do($sql);

my $sql = <<'END_SQL';
CREATE TABLE t3 (
id INTEGER PRIMARY KEY,
c1 VARCHAR(100),
c2 VARCHAR(100),
c3 VARCHAR(100),
c4 VARCHAR(100),
)
END_SQL

$dbh->do($sql);
### populate data
open my $fh, $ARGV[0] or die "Can't open $ARGV[0] for reading: $!";
while( my $line = <$fh> ){
    my @cols = split('\|', $line);
    $dbh->do('INSERT INTO t1 (id, c1, c2, c3, c4) VALUES (?, ?, ?)',undef,$col[0],$col[1],$col[2],$col[3]);
}
close($fh);
$dbh->commit();
open my $fh, $ARGV[1] or die "Can't open $ARGV[1] for reading: $!";
while( my $line = <$fh> ){
    my @cols = split('\|', $line);
    $dbh->do('INSERT INTO t2 (id, c1, c2, c3, c4) VALUES (?, ?, ?)',undef,$col[0],$col[1],$col[2],$col[3]);
}
close($fh);
$dbh->commit();
open my $fh, $ARGV[2] or die "Can't open $ARGV[2] for reading: $!";
while( my $line = <$fh> ){
    my @cols = split('\|', $line);
    $dbh->do('INSERT INTO t3 (id, c1, c2, c3, c4) VALUES (?, ?, ?)',undef,$col[0],$col[1],$col[2],$col[3]);
}
close($fh);
$dbh->commit();
### process data
my $sql = 'SELECT t1.c1, t1.c2, t1.c3, t2.c2, t2.c3, t3.c2 FROM t1,t2,t3 WHERE t1.c1=t2.c1 AND t1.c1=t3.c1 ORDER BY t1.c1';
my $sth = $dbh->prepare($sql);
$sth->execute(1, 10);
while (my @row = $sth->fetchrow_array) {
    print join("\t",@row)."\n";
}

$dbh->disconnect;
#unlink($dbfile);

Thanks! i will give this a shot. Wonder what the performance will be as well. — user1090708, Nov 05 '13 at 16:51
I am trying to test it with Cygwin and get this error: install_driver(SQLite) failed: Can't locate DBD/SQLite.pm in @INC (@INC contains: /usr/lib/perl5/site_perl/5.14/x86_64-cygwin-threads /usr/lib/perl5/site_perl/5.14 /usr/lib/perl5/vendor_perl/5.14/x86_64-cygwin-threads /usr/lib/perl5/vendor_perl/5.14 /usr/lib/perl5/5.14/x86_64-cygwin-threads /usr/lib/perl5/5.14 .) at (eval 3) line 3.Perhaps the DBD::SQLite perl module hasn't been fully installed,or perhaps the capitalisation of 'SQLite' isn't right.Available drivers: DBM, ExampleP, File, Gofer, Proxy, Sponge, mysql. at useDB.pl line 12 — user1090708, Nov 05 '13 at 17:47
Found my answer. Needed to install all the missing components and install: — user1090708, Nov 05 '13 at 19:58
This allow us to login CPAN prompt : # perl -MCPAN -e shell Once you are in CPAN Prompt : cpan> install DBD::SQLite — user1090708, Nov 05 '13 at 19:59

Join multiple files into one using a key and rearrange the columns using perl.

3 Answers3