Match common IDs between two huge csv files

Question

I need to compare two huge csv files with a thousand of entries like bellow:

id;val

1;a
2;b 
3;c

Ans second file has the following structure

id1;entry    
1;002
2;x90 
5;d07

The desired result is to match and combine the lines with the same value for id/id1 and create a third csv file with only matched entries showing bellow:

idR;valR;entryR
1;a;002
2;b;x90

To accomplish this I can load each file in a distinct database table and perform a select like this to retrieve all matched values:

select tb1.id, tb1.val tb2.entry 
  from tb1, tb2
 where tb1.id = tb2.1

At once I can retrieve all values desired with this approach.

But let's suppose these files could be sorted and in this way use it's possible to use awk to print the results for a entries with the same values for id and id1. The best that I can do is to create two associative arrays for each value and perform a binary search using awk and sed/cut?

It's possible to load these two files and combine them at once to produce a final csv file with the results?

Or I can to this with perl with standard lib?

theres actually a standard utility called [`join`](http://linux.die.net/man/1/join) which can do this — amdixon, Jul 06 '15 at 11:55
You say the files are CSV but the examples you show are space-separated. Edit your question to be clear, consistent, and precise - if your files are space-separated stop talking about CSV and if they're comma separated then show that as your sample input/output. — Ed Morton, Jul 06 '15 at 12:36

score 5 · Answer 1 · answered Jul 06 '15 at 11:58

5

Load the information in memory with awk and then print the line if the id matches:

$ awk 'FNR==NR {a[$1]=$2; next} ($1 in a) {print $1, $2, a[$1]}' f2 f1
1 a 002
2 b x90

Explanation

The basic idea is to do things when reading file1 and other_things when reading file2:

awk 'FNR==NR {things; next} {other_things}' file1 file2

In our case, things is to store the content of file2 in memory, mapping every id to its value.

Then, it goes through file1 and prints the content of the line, together with the mapped value, if there is a common id.

answered Jul 06 '15 at 11:58

fedorqui

275,237
103
548
598

1

note I removed the headers on your file, since it doesn't seem to be important. If they are, let us know. – fedorqui Jul 06 '15 at 12:02
1

Note2: `awk` performs quite well in these kind of problems. See [Remove duplicates from text file based on second text file](http://stackoverflow.com/q/30820894/1983854) and the time comparisons. – fedorqui Jul 06 '15 at 12:18
2

Note that this solution does not require the input files to be sorted while the `join` solution does. What the perl solution needs is a mystery :-). – Ed Morton Jul 06 '15 at 12:39
1

Perl solution doesn't require much at all - but it does use more memory because it splices them whilst there. – Sobrique Jul 06 '15 at 12:42
@Sobrique wrt my comment about join and awk, the question is does the perl solution require the input files to be sorted or not. – Ed Morton Jul 06 '15 at 13:13
1

Nope. It sorts 'em itself. (This might be an undesired side effect of course) – Sobrique Jul 06 '15 at 13:13
1

FYI for output - the `awk` solution will print the `f2` lines that exist in `f1` in input order first and then the `f1` lines that don't exist in `f2` will be printed afterwards in random order. The `join` solution will print the lines in merged input order but since they need to be sorted first for `join` to work, that's no better in terms of output order than just sorting the `awk` output. – Ed Morton Jul 06 '15 at 13:16
Thanks guys so much for your help! The join solution works also in Solaris! – Bera Jul 08 '15 at 13:52
1

@Bera and `awk` also works in Solaris, only that you need to use `/usr/xpg4/bin/awk`. – fedorqui Jul 08 '15 at 13:57
yes, I verified this after search from distinct bash tools available on solaris. Thank you again so much! – Bera Jul 09 '15 at 11:11

amdixon · Accepted Answer · 2015-07-06T12:18:56.960

4

Can do this with standard join utility

file1.txt

1 a
2 b
3 c

file2.txt

1 002
2 x90
5 d07

join example

join -1 1 -2 1 -o 1.1,1.2,2.2 file1.txt file2.txt

here join is joining from file1.field1 to file2.field2 and outputting the fields specified with the -o flag

output

1 a 002
2 b x90

edited Jul 06 '15 at 12:18

answered Jul 06 '15 at 12:10

amdixon

3,814
8
25
34

works perfect for me! Thank you so much! – Bera Jul 08 '15 at 13:51

Sobrique · Answer 3 · 2015-07-06T12:43:58.407

Size is the difficult part, as to merge files you may need to read in the whole lot.

However for a general solution the the problem in perl:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV;

my %count_of;
my @field_order;

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or die $!;
    my $header_row = $csv->getline($input);
    foreach my $header (@$header_row) {
        if ( not $count_of{$header} ) {
            push( @field_order, $header );
        }
        $count_of{$header}++;
    }
}

print "Common headers:\n";
my @common_headers = grep { $count_of{$_} >= @ARGV } keys %count_of;
print join( "\n", @common_headers );

my %lookup_row;
my $key_field;
if (@common_headers) { $key_field = shift @common_headers };

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or die $!;
    my @headers = @{ $csv->getline($input) };
    $csv->column_names(@headers);
    while ( my $row_hr = $csv->getline_hr($input) ) {
        my $key = $.;
        if ($key_field) {
            $key = $row_hr->{$key_field};
        }
        $lookup_row{$key}{$file} = $row_hr;
    }
    close($input);
}

my $csv_out = Text::CSV->new( { binary => 1 } );
my $header_row = \@field_order;
$csv_out->print( \*STDOUT, $header_row );
print "\n";

foreach my $key ( sort keys %lookup_row ) {
    my %combined_row;
    foreach my $file ( sort keys %{ $lookup_row{$key} } ) {
        foreach my $header (@field_order) {
            if ( $lookup_row{$key}{$file}{$header} ) {
                if (   not defined $combined_row{$header}
                    or not $combined_row{$header} eq
                    $lookup_row{$key}{$file}{$header} )
                {
                    $combined_row{$header}
                        .= $lookup_row{$key}{$file}{$header};
                }
            }
        }
    }
    my @row = @combined_row{@field_order};
    $csv_out->print( \*STDOUT, \@row );
    print "\n";
}

Note that Text::CSV can be altered to redirect output to a file handle rather than STDOUT which is probably not what you want for large files (or y'know, just > output.csv. )

You can also configure the delimiter for Text::CSV via sep_char:

my $csv = Text::CSV -> new ( { binary => 1, sep_char => "\t" } );

I was unclear what your separator was, so have assumed comma (as you refer to csv).

Script above will pick out a common field and merge on that, or line number if none exists.

Note:

This script reads files into memory and merges them there, sorting and joining on a common key. It will sort based on this for output. It's therefore memory greedy, but should 'just work' in a lot of cases. Just specify the filenames splice.pl file1.csv file2.csv file3.csv

If there is a common field in these files, it'll join on those and output in order. If there isn't, it'll use line number.

Thank you for you nice code! Perl programmers are all gurus! — Bera, Jul 08 '15 at 13:53

Match common IDs between two huge csv files

3 Answers3

Explanation

Linked