Comparing two CSV files and producing a third

Question

So lately I am in work every month we have a validations check with patients. Takes days to compare last months validations with the current months:

SeptemberVal.CSV

    Gender    MRN     Operation     Consultant  TCI Date  ... ... ...
    Male    738495      CIRC        Dr Yates    05.12.13  ... ... ...
    Female  247586    Cystoscopy    Dr Know     10.12.13  ... ... ...
    Male    617284      Biopsy      Dr Yates    25.12.13  ... ... ...

OctoberVal.CSV

    Gender    MRN     Operation     Consultant  TCI Date  ... ... ...
    Male    491854      Biopsy      Dr Yates    05.12.13  ... ... ...
    Female  247586    Cystoscopy    Dr Know     10.12.13  ... ... ...
    Female  285769      Biopsy      Dr Yates    25.12.13  ... ... ...
    ...     ...         ...         ...         ...       ... ... ...

Output.csv

    Gender    MRN     Operation     Consultant  TCI Date  ... ... ...
    Female  247586    Cystoscopy    Dr Know     10.12.13  ... ... ...
    ...     ...       ...           ...         ...       ... ... ...

I want to create a perl script which compares the "MRN"s column of both SeptermberVal.csv and "OctoberVal.csv" and then once it finds a match I want it to copy the entire row of the match from SeptemberVal.CSV and paste it into a new file.

There could be 800 patients per validation form and many can follow on from the previous month so say next month I have 900 patients to validate 400 might be on the previous form and the rest are new.

Is this possible with Perl, if so how would I go about it? I would be grateful if anyone had any example code of how to do this. I want to pick up Perl in the long run as it is used widely in working communities.

There exists a [DBD driver for CSV](http://search.cpan.org/~hmbrand/DBD-CSV-0.41/lib/DBD/CSV.pm) files, which supports also SQL joins. — ceving, Nov 27 '13 at 20:15

score 0 · Answer 1 · edited May 23 '17 at 10:25

0

There's a perl example at *nix: perform set union/intersection/difference of lists . You would have to adapt it so that it only looks at the MRN column for testing.

edited May 23 '17 at 10:25

Community

1
1

answered Nov 27 '13 at 19:50

jez

14,867
5
37
64

score 0 · Answer 2 · answered Nov 27 '13 at 21:00

0

you should try the unix command join

join lets you:

select a field separator (comma);
select the field used for the join (2);
format the output (rows from SeptemberVal.CSV)

answered Nov 27 '13 at 21:00

Pierre

1,204
8
15

Mark Setchell · Answer 3 · 2013-11-28T12:40:16.820

Here you go - this should do it pretty nicely for you and it is easy to read and modify too.

#!/usr/bin/perl
################################################################################
# File: ProcessMRNs
# Author: Mark Setchell
# stackoverflow.com/questions/20251625/perl-comparing-two-csv-files-and-producing-a-third
################################################################################
use strict;
use warnings;
use Data::Dumper;

    my $Debug=0;    # Set to 1 for debug output

    # Check user has supplied last month and this month's CSV file
    if($#ARGV !=1){
       print "Usage: $0 <last_monthCSV> <this_monthCSV>\n";
       exit 1;
    }

    # Pick up CSV filenames from parameters
    my $lastmonth=$ARGV[0];
    my $thismonth=$ARGV[1];

    # Hash to keep last month's records in, indexed by MRN
    my %prevMRNs;
    my $header;

    # Open last month's file and read into hash indexed by MRN
    my $MRN;
    open(FH,"<",$lastmonth) or die "Unable to open $lastmonth";
    while(<FH>){
       chomp;               # Remove end of line junk
       (undef,$MRN,undef)=split(" ");   # Extract MRN from line
       # Save table header if this is it
       if($MRN =~ /MRN/){
          $header=$_;
          next;
       }
       print "DEBUG: Read last month MRN:$MRN\n" if $Debug;
       # Save this MRN into our hash of records, indexed by MRNs
       $prevMRNs{$MRN}=$_;
    }
    close FH;

    # Show user what we got from last month's CSV
    print Dumper \%prevMRNs if $Debug;

    # Now open this month's file 
    open(FH,"<",$thismonth) or die "Unable to open $thismonth";
    print "$header\n";
    while(<FH>){
       chomp;               # Remove end of line junk
       (undef,$MRN,undef)=split(" ");   # Extract MRN
       next if $MRN =~ /MRN/;       # Ignore header line
       print "DEBUG: Read current month MRN:$MRN\n" if $Debug;
       # THIS IS THE CRITICAL LINE IN THE WHOLE SCRIPT
       # If we saw this MRN last month, print what we saw
       print "$prevMRNs{$MRN}\n" if defined $prevMRNs{$MRN};
    }
    close FH;

Here is the output without debug:

    Gender    MRN     Operation     Consultant  TCI Date  ... ... ...
    Female  247586    Cystoscopy    Dr Know     10.12.13  ... ... ...

Here is the output with debug:

DEBUG: Read last month MRN:738495
DEBUG: Read last month MRN:247586
DEBUG: Read last month MRN:617284
$VAR1 = {
          '247586' => '    Female  247586    Cystoscopy    Dr Know     10.12.13  ... ... ...',
          '617284' => '    Male    617284      Biopsy      Dr Yates    25.12.13  ... ... ...',
          '738495' => '    Male    738495      CIRC        Dr Yates    05.12.13  ... ... ...'
        };
    Gender    MRN     Operation     Consultant  TCI Date  ... ... ...
DEBUG: Read current month MRN:491854
DEBUG: Read current month MRN:247586
    Female  247586    Cystoscopy    Dr Know     10.12.13  ... ... ...
DEBUG: Read current month MRN:285769

Assuming you save it as "ProcessMRNs", you run it by doing this:

chmod +x ProcessMRNs
./ProcessMRNs september.csv october.csv

and if you want the output into a file instead of the screen, add "> output.txt" to the end like this:

./ProcessMRNs september.csv october.csv > output.txt

Did this work for you? If so, could you kindly accept my answer with a lovely big green tick? If not, please say what went wrong so I/someone else can assist you further. — Mark Setchell, Dec 04 '13 at 10:30

Mark Setchell · Answer 4 · 2013-11-28T13:45:57.950

Just for fun, here is another (simpler) answer:

awk 'FNR==NR{a[$2]=$0;next}{if ($2 in a)print a[$2]}' september.csv october.csv

with results:

Gender    MRN     Operation     Consultant  TCI Date  ... ... ...
Female  247586    Cystoscopy    Dr Know     10.12.13  ... ... ...

This works exactly the same as the Perl solution, but uses awk's associative arrays (like Perl's hashes) and also a trick to process 2 input files, namely september.csv and october.csv.

The "FNR==NR" part (up to the "next") applies to processing the first file, and for each record it finds in that file, it saves the entire record ($0) in an associative array ("a") indexed by the MRN (field 2, or $2).

Then (starting at the "if") it processes the second file (october.csv) and says "if this MRN (field 2 or $2) is in array "a" (from the first pass through september.csv) then print whatever line we found with this MRN at that point.

score 0 · Answer 5 · answered Dec 25 '14 at 21:16

How's your Perl?

First off, you should use something like Text::ParseWords or Text::CSV for reading in your files. Both of these handle columnized files and handle quotation marks. Text::CSV is the most popular, but Text::ParseWords comes with Perl, so it's always available.

Is that MRN number unique to each file? If so, you may want to use that as a key to your data structure. You're going to have to know how to use references in Perl, so if you don't know about Perl references, read the tutorial.

Think of each line of your file as being keyed by that MRN number, and each line as a reference to another hash, where each column is keyed by that column's name:

$october{738495}->{gender}     = "M";
$october{738495}->{operation}  = "CIRC";
$october{738495}->{consultant} = "Dr Yates";
$october{738495}->{tci_date}   = "05.12.13";

Now, you can go through this structure for September, and pull up if you have the same MRI for October:

for my $mri ( sort keys %september ) {
    if ( exists $october{$mri} ) {       # Similar MRI found in September and October
        if ( $september{$mri}->{gender} eq $october{$mri}->{gender}
            and $september{$mri}->{consultant} eq  $september{$mri}->{consultant} 
            ... ) {
            ....
        else {
            ....
        }
    }
}

If you know Object Oriented Perl, you should use that and help normalize things like Gender and Consultant names, Dates, etc.

Comparing two CSV files and producing a third

5 Answers5