I have 2 csv files of multiple fields(approx 30 fields), and huge size ( approx 4GB ).
File1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
File2:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
I want to compare these 2 files and report the differences into another csv file. In the above example, Employee Vivek and Karthick details are present in different row numbers but still the record data is same, so it should be considered as match. Employee Vinoth record should be considered as a mismatch since there is a mismatch in the address.
Output diff.csv file can contain the mismatched record from the File1 and File 2 as below.
Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India"
F2, Vinoth,12,2548.245,"140,North Street,USA"
I've written the code so far as below. After this I'm confused which option to choose whether a Binary Search or any other efficient way to do this. Could you please help me?
My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.
use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );
# Get Command Line Parameters
my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
or die("Error in command line arguments\n");
if ( !defined $opts{'file1'} ) {
die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
die "CSV file --file2 not specified.\n";
}
my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';
print $file2 . "\n";
my $csv1 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
open( my $fh1, '<:encoding(utf8)', $file1 )
or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );
my $f1line = undef;
my $f2line = undef;
my $header1 = undef;
my $f1empty = 'false';
my $f2empty = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
while (1) {
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
if ( !defined $f1line ) {
$f1empty = 'true';
}
if ( $f1empty eq 'true' ) {
last;
}
else {
## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.
}
}
print "End of Program" . "\n";