Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl

Question

I have 2 csv files of multiple fields(approx 30 fields), and huge size ( approx 4GB ).

File1:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"

File2:

EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"

I want to compare these 2 files and report the differences into another csv file. In the above example, Employee Vivek and Karthick details are present in different row numbers but still the record data is same, so it should be considered as match. Employee Vinoth record should be considered as a mismatch since there is a mismatch in the address.

Output diff.csv file can contain the mismatched record from the File1 and File 2 as below.

Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India" 
F2, Vinoth,12,2548.245,"140,North Street,USA"

I've written the code so far as below. After this I'm confused which option to choose whether a Binary Search or any other efficient way to do this. Could you please help me?

My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.

use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );

# Get Command Line Parameters

my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
  or die("Error in command line arguments\n");

if ( !defined $opts{'file1'} ) {
    die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
    die "CSV file --file2 not specified.\n";
}

my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';

print $file2 . "\n";

my $csv1 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
  Text::CSV_XS->new(
    { binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );

open( my $fh1, '<:encoding(utf8)', $file1 )
  or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
  or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
  or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );

my $f1line   = undef;
my $f2line   = undef;
my $header1  = undef;
my $f1empty  = 'false';
my $f2empty  = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
    $f1line = $csv1->getline($fh1);
}
while (1) {

    if ( $f1empty eq 'false' ) {
        $f1line = $csv1->getline($fh1);
    }
    if ( !defined $f1line ) {
        $f1empty = 'true';
    }

    if ( $f1empty eq 'true' ) {
        last;
    }
    else {
        ## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.

        }
    }

print "End of Program" . "\n";

If only need to dump the file diff in new file.You can use use diff command ``` diff file2 file1 > file3 ``` you will get this :;< Vinoth,12,2548.245,"140,North Street,USA" < Karthick,10,10.245,"140,North Street,India" --- > Vinoth,12,2548.245,"140,North Street,India" 4a4 > Karthick,10,10.245,"140,North Street,India" — Dragos Trif, May 31 '20 at 12:10
@Dragos Trif, That's not going to work because the files are sorted differently. You should have realized this when you got different output than the one requested... (There would also be problems with using diff even if the files were sorted identically.) — ikegami, May 31 '20 at 12:27
Re "*I'm confused which option to choose whether a Binary Search or any other efficient way to do this.*", Binary search requires sorted input, and you don't appear to have sorted input. (If you did, you could use the same approach used by a merge sort, requiring only one line from each file in memory at a time). If you can get the data sorted from the source, do that. Otherwise, consider using a database. — ikegami, May 31 '20 at 12:31
I use [Beyond Compare](https://www.scootersoftware.com/) a lot. It handles CSV files and can do [exactly what you want](https://i.stack.imgur.com/Q2hua.png). I don't know how well it can handle files of the size in question. — ikegami, May 31 '20 at 12:33
Like I said, use a database. It can even be a server-less DB. — ikegami, May 31 '20 at 14:32
@VinothKarthick -- To compare files in computer's memory it would require to read both files into RAM, 4GBx2=8GB -- not very good approach. To compare files on line by line basis it require files to be sorted, that would simplify the processing and speeds up comparison. Loading data into database and utilization of the SQL language to compare tables would be most optimal solution. You can use [SQLlite](https://www.sqlite.org/index.html), or if you already use [MariaDB](https://mariadb.org/) it would be not bad choice. [PostgreSQL](https://www.postgresql.org/), [MySQL](https://www.mysql.com/) — Polar Bear, May 31 '20 at 16:45
Since you _can_ read a whole file into memory, then why not read it into a hash (name being a key), and then go through the other file line by line doing a lookup for each name? (A hash-of-hashes you mention is slower to populate and takes way more memory but is faster and more convenient for comparisons later.) Once you have a file in a hash there is no need for any search. Or, if this is done repeatedly with similar data, by all means use a database (sqlite is very simple to use). — zdim, May 31 '20 at 16:47
What about same names in one file, bound to happen in 4Gb of data? — zdim, May 31 '20 at 16:51

score 2 · Accepted Answer · answered May 31 '20 at 18:18

Storing data of such magnitude in database is most correct approach to tasks of this kind. At minimum SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL will work quite well.

Following code demonstrates how desired output can be achieved without special modules, but it does not take in account possibly messed up input data. This script will report data records as different even if difference can be just one extra space.

Default output is into console window unless you specify option output.

NOTE: Whole file #1 is read into memory, please be patient processing big files can take a while.

use strict;
use warnings;
use feature 'say';

use Getopt::Long qw(GetOptions);
use Pod::Usage;

my %opt;
my @args = (
            'file1|f1=s',
            'file2|f2=s',
            'output|o=s',
            'debug|d',
            'help|?',
            'man|m'
        );

GetOptions( \%opt, @args ) or pod2usage(2);

print Dumper(\%opt) if $opt{debug};

pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};

pod2usage(1) unless $opt{file1};
pod2usage(1) unless $opt{file2};

unlink $opt{output} if defined $opt{output} and -f $opt{output};

compare($opt{file1},$opt{file2});

sub compare {
    my $fname1 = shift;
    my $fname2 = shift;

    my $hfile1 = file2hash($fname1);

    open my $fh, '<:encoding(utf8)', $fname2
        or die "Couldn't open $fname2";

    while(<$fh>) {
        chomp;
        next unless /^(.*?),(.*)$/;
        my($key,$data) = ($1, $2);
        if( !defined $hfile1->{$key} ) {
            my $msg = "$fname1 $key is missing";
            say_msg($msg);
        } elsif( $data ne $hfile1->{$key} ) {
            my $msg = "$fname1 $key,$hfile1->{$key}\n$fname2 $_";
            say_msg($msg);
        }
    }
}

sub say_msg {
    my $msg = shift;

    if( $opt{output} ) {
        open my $fh, '>>:encoding(utf8)', $opt{output}
            or die "Couldn't to open $opt{output}";

        say $fh $msg;

        close $fh;
    } else {
        say $msg;
    }
}

sub file2hash {
    my $fname = shift;
    my %hash;

    open my $fh, '<:encoding(utf8)', $fname
        or die "Couldn't open $fname";

    while(<$fh>) {
        chomp;
        next unless /^(.*?),(.*)$/;
        $hash{$1} = $2;

    }

    close $fh;

    return \%hash;
}

__END__

=head1 NAME

comp_cvs - compares two CVS files and stores differense 

=head1 SYNOPSIS

 comp_cvs.pl -f1 file1.cvs -f2 file2.cvs -o diff.txt

 Options:
    -f1,--file1 input CVS filename #1
    -f2,--file2 input CVS filename #2
    -o,--output output filename
    -d,--debug  output debug information
    -?,--help   brief help message
    -m,--man    full documentation

=head1 OPTIONS

=over 4

=item B<-f1,--file1>

Input CVS filename #1

=item B<-f2,--file2>

Input CVS filename #2

=item B<-o,--output>

Output filename

=item B<-d,--debug>

Print debug information.

=item B<-?,--help>

Print a brief help message and exits.

=item B<--man>

Prints the manual page and exits.

=back

=head1 DESCRIPTION

B<This program> accepts B<input> and processes to B<output> with purpose of achiving some goal.

=head1 EXIT STATUS

The section describes B<EXIT STATUS> codes of the program

=head1 ENVIRONMENT

The section describes B<ENVIRONMENT VARIABLES> utilized in the program

=head1 FILES

The section describes B<FILES> which used for program's configuration

=head1 EXAMPLES

The section demonstrates some B<EXAMPLES> of the code

=head1 REPORTING BUGS

The section provides information how to report bugs

=head1 AUTHOR

The section describing author and his contanct information

=head1 ACKNOWLEDGMENT

The section to give credits people in some way related to the code

=head1 SEE ALSO

The section describing related information - reference to other programs, blogs, website, ...

=head1 HISTORY

The section gives historical information related to the code of the program

=head1 COPYRIGHT

Copyright information related to the code

=cut

Output for test files

file1.cvs Vinoth,12,2548.245,"140,North Street,India"
file2.cvs Vinoth,12,2548.245,"140,North Street,USA"

I could not use any database as we are not allowed. Could this code be optimized by using CPAN Modules for comparing huge files? — Vinoth Karthick, May 31 '20 at 18:29
This isn't a good solution to a situation with two 4GB files. You do not want to read that all into Perl. In general, programs that depend on marshaling all the data into in-memory data structures will eventually cause problems. Not only that, these answers would be much better if you removed all the extraneous bits and focused on the key concepts. Something like `say_msq` just wastes space and isn't even a good use of resources. Choose an output handle, default to STDOUT, and leave it at that. Don't reopen a file every time. But, don't include the distraction in the first place. — brian d foy, May 31 '20 at 19:46
@briandfoy - What would be best optimized , high performance steps/approach for this problem in perl? I could not use DB for this — Vinoth Karthick, May 31 '20 at 21:05
"best", "optimized", "high performance" only have meaning in a context. I don't know anything about the requirements, resources, or the actual data files. There isn't one answer that's going to cover every situation. I started to type out some tips, but then decided to do something else. I think you should clarify current run times and memory use and what goal you have for those. — brian d foy, May 31 '20 at 21:32

score 1 · Answer 2 · answered Jun 02 '20 at 14:34

#!/usr/bin/env perl

use Data::Dumper;
use Digest::MD5;
use 5.01800;
use warnings;

my %POS;
my %chars;

open my $FILEA,'<',q{FileA.txt}
    or die "Can't open 'FileA.txt' for reading! $!";
open my $FILEB,'<',q{FileB.txt}
    or die "Can't open 'FileB.txt' for reading! $!";
open my $OnlyInA,'>',q{OnlyInA.txt}
    or die "Can't open 'OnlyInA.txt' for writing! $!";
open my $InBoth,'>',q{InBoth.txt}
    or die "Can't open 'InBoth.txt' for writing! $!";
open my $OnlyInB,'>',q{OnlyInB.txt}
    or die "Can't open 'OnlyInB.txt' for writing! $!";
<$FILEA>,
    $POS{FILEA}=tell $FILEA;
<$FILEB>,
    $POS{FILEB}=tell $FILEB;
warn Data::Dumper->Dump([\%POS],[qw(*POS)]),' ';

{ # Scan for first character of the records involved
    while (<$FILEA>) {
        $chars{substr($_,0,1)}++;
        };
    while (<$FILEB>) {
        $chars{substr($_,0,1)}--;
        };
    # So what characters do we need to deal with?
    warn Data::Dumper->Dump([\%chars],[qw(*chars)]),' ';
    };
my @chars=sort keys %chars;
{
    my %_h;
    # For each of the characters in our character set
    for my $char (@chars) {
        warn Data::Dumper->Dump([\$char],[qw(*char)]),' ';
        # Beginning of data sections
        seek $FILEA,$POS{FILEA},0;
        seek $FILEB,$POS{FILEB},0;
        %_h=();
        my $pos=tell $FILEA;
        while (<$FILEA>) {
            next
                unless (substr($_,0,1) eq $char);
            # for each record save the lengthAndMD5 as the key and its start as the value
            $_h{lengthAndMD5(\$_)}=$pos;
            $pos=tell $FILEA;
            };
        my $_s;
        while (<$FILEB>) {
            next
                unless (substr($_,0,1) eq $char);
            if (exists $_h{$_s=lengthAndMD5(\$_)}) { # It's a duplicate
                print {$InBoth} $_;
                delete $_h{$_s};
                }
            else { # (Not in FILEA) It's only in FILEB
                print {$OnlyInB} $_;
                }
            };
        # only in FILEA
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        for my $key (keys %_h) { # Only in FILEA
            seek $FILEA,delete $_h{$key},0;
            print {$OnlyInA} scalar <$FILEA>;
            };
        # Should be empty
        warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' ';
        };
    };

close $OnlyInB
    or die "Could NOT close 'OnlyInB.txt' after writing! $!";
close $InBoth
    or die "Could NOT close 'InBoth.txt' after writing! $!";
close $OnlyInA
    or die "Could NOT close 'OnlyInA.txt' after writing! $!";
close $FILEB
    or die "Could NOT close 'FileB.txt' after reading! $!";
close $FILEA
    or die "Could NOT close 'FileA.txt' after reading! $!";
exit;

    sub lengthAndMD5 {
        return sprintf("%8.8lx-%32.32s",length(${$_[0]}),Digest::MD5::md5_hex(${$_[0]}));
        };

__END__

Compare 2 CSV Huge CSV Files and print the differences to another csv file using perl

2 Answers2