Parse comma delimited file based on a start and end date

Question

I am new to perl so please accept my apologies if my question is trivial. I have a very large file with data that looks like the following:

Date, Time, Data1, Data2, Data3  
1/4/1999,9:31:00 AM,blah, blah, blah  
1/4/1999,9:32:00 AM,blah, blah, blah  
1/4/1999,9:33:00 AM,blah, blah, blah

I have a file named 'cities.txt' which has a list of cities located on different rows with a comma at the end of the row.

i.e.

Boston,  
Atlanta,  
Seattle,

Each city has its own file in that same directory that has the following naming convention 'Boston 1 Minute Moisture Data.txt'. I want to first read the 'cities.txt' file and for each city that appears in that file find the associated moisture data file and extract all the data (rows) between and including TWO sets of dates (a START and an END date) and SAVE that to another file. The date is located in the first column.

I have read through comments made in the following post but I am still very confused.

How do I efficiently parse a CSV file in Perl?

I wrote a simple script using some examples online. Firstly, I just wanted to see if I was using the module correctly. So all I wanted to do was to get the parser to parse the fields and calculate the sum of a specific column.

#!/usr/bin/perl  
use strict;  
use warnings;  

use Text::CSV_XS;  
my $csv = Text::CSV_XS->new();  

my $file = 'Boston 1 Minute Moisture Data.csv';  

my $sum = 0;  
open(my $data, '<', $file) or die "Could not open '$file'\n";  
while (my $line = <$data>) {  
    chomp $line;  

    if ($csv->parse($line)) {  
        my @columns = $csv->fields();  

        $sum += $columns[3];  
    } else {  
        warn "Line could not be parsed: $line\n";  
    }  
}  
print "$sum\n";

The result is get is "Line could not be parsed: $line\n". For some reason the parser isn't parsing the fields. Any ideas?

I also tried the following code:

#!/usr/bin/perl  
use strict;  
use warnings;  
use Text::CSV;  

my $file = 'Boston 1 Minute Moisture Data.csv';  
my $csv = Text::CSV->new();  

open (CSV, "<", $file) or die $!;  

while (<CSV>) {  
    if ($csv->parse($_)) {  
        my @columns = $csv->fields();  
        #print "@columns\n";  
        print fields[1];  
        } else {  
        my $err = $csv->error_input;  
        print "Failed to parse line: $err";  
    }  
}  
close CSV;

I get the following result for every line in the file:

print() on unopened filehandle fields at test2.pl line 16, line 9326.

From the [SO Perl FAQ](http://stackoverflow.com/questions/tagged/perl?sort=faq): [What's the easiest way to install a missing Perl module?](http://stackoverflow.com/questions/65865/whats-the-easiest-way-to-install-a-missing-perl-module) — daxim, Mar 20 '11 at 23:10
Just read the first sentence of the first response on the linked post. — jiggy, Mar 20 '11 at 23:44
Your example data doesn't show which of the 'blahs' has the city name in it, which makes it hard to work out how to help you. — Jonathan Leffler, Mar 21 '11 at 03:05
There are 2 files. The first file has names of cities. The second file has moisture data for each city captured every 1 minute. So what I've been doing is trying to see if I can first parse the 2nd file (i.e. the file with the moisture data). I want to figure out how to delete records that are NOT within a user specified date range judging by the first column which is the date. Alternatively I want to figure out how to extract and save the records that DO fall within the date range and save those records to another csv file. I am good with whatever method is the cleanest i.e. delete or extract — codingknob, Mar 21 '11 at 03:11
Change "`warn "Line could not be parsed: $line\n"`" to "`die 'Line could not be parsed: ' . $csv->error_diag() . "\n"`" and let us know the results (this will tell you *why* the line couldn't be parsed, it's then much easier to decide what to fix). — Max Lybbert, Mar 21 '11 at 17:12
I will check into that later today. Thank you folks - I greatly appreciate it. — codingknob, Mar 22 '11 at 15:50

Jonathan Leffler · Accepted Answer · 2011-03-21T03:52:18.550

The overall structure of a solution to your problem appears to be:

open cities.txt
for each line read from cities.txt
- open the "$city 1 Minute Moisture Data.txt" file
- for each line from the moisture file
  - if the line's date falls within range
  - add the line to the save file

You have not specified whether there is a separate save file per city.

Your trial solutions are correctly using the Text::CSV module - that is good. You also need some way of parsing the date values - both the input values (the start and end dates) and the scanned values (from the moisture data). I'd probably use the POSIX::strptime module, but you could use any of a myriad other date and time manipulation modules.

It isn't great Perl - but the code below seems to work when run as:

$ perl scan.pl 1/3/1999 30/4/1999
Boston,1/4/1999,9:31:00 AM,blah, blah, blah  
Boston,1/4/1999,9:32:00 AM,blah, blah, blah  
Boston,1/4/1999,9:33:00 AM,blah, blah, blah
Atlanta,1/4/1999,9:31:00 AM,blah, blah, blah  
Atlanta,1/4/1999,9:32:00 AM,blah, blah, blah  
Atlanta,1/4/1999,9:33:00 AM,blah, blah, blah
Seattle,1/4/1999,9:31:00 AM,blah, blah, blah  
Seattle,1/4/1999,9:32:00 AM,blah, blah, blah  
Seattle,1/4/1999,9:33:00 AM,blah, blah, blah
$ perl scan.pl 1/3/2000 30/4/2000
$

(Given the cities data from the question, and a copy of the example data for each city. I'm assuming normal (eg UK) style dates with the sequence day, month, year. If you are working with American style dates, you have adjustments to make. If you play with invalid dates, you will get errors; the error handling in get_date() is non-existent.)

#!/usr/bin/env perl
use strict;
use warnings;
use POSIX::strptime;
use Text::CSV;

my $cities = "cities.txt";

die "Usage: $0 start-date end-date\n" if scalar(@ARGV) != 2;

my $start = get_date($ARGV[0]);
my $end   = get_date($ARGV[1]);

{
    open my $cfh, "<", $cities or die "Failed to open $cities ($!)";
    while (<$cfh>)
    {
        chomp;
        my $city = $_;
        $city =~ s/\s*,.*//;
        $city =~ s/^\s*//;
        my $moisture = "$city 1 Minute Moisture Data.txt";
        open my $mfh, "<", $moisture or die "Failed to open $moisture ($!)";
        process_file($mfh, $moisture, $city);
    }
}

sub get_date
{
    my($str) = @_;
    my ($mday, $mon, $year) = ( POSIX::strptime($str, '%d/%m/%Y') )[3,4,5];
    return (($year + 1900) * 100 + ($mon + 1)) * 100 + $mday;
}

sub process_file
{
    my($fh, $file, $city) = @_;
    my $csv = Text::CSV->new() or die "Failed to create Text::CSV object";
    my $line = <$fh>;
    die "Unexpected EOF in $file" unless defined $line;
    while ($line = <$fh>)
    {
        chomp $line;
        die "Failed to parse line <<$line>>" unless $csv->parse($line);
        my @columns = $csv->fields();
        die "Insufficient columns in <<$line>>" if scalar(@columns) < 1;
        my $date = get_date($columns[0]);
        print "$city,$line\n" if ($date >= $start && $date <= $end);
    }
}

Thanks Jonathan for the reply - I greatly appreciate it. Yes the logic you have laid out above is what I am thinking. However, in my attempt at using Text::CSV and Text:CSV_XS I run into problems. In the case of CSV I get filehandle open errors and in the case of CSV_XS I get could not parse problems. Any idea why? — codingknob, Mar 21 '11 at 03:33
@user668624L Amongst other problems, the Text::CSV code code references `fields[1]` instead of `$columns[0]`, and attempts to open `Boston 1 Minute Moisture Data.csv` instead of `Boston 1 Minute Moisture Data.txt`; fix those and print newlines and it works sanely. — Jonathan Leffler, Mar 21 '11 at 04:17
@user668624: The Text::CSV_XS code also attempts to open the wrong file name (according to the question) - the `.csv` suffix is used where `.txt` is specified. — Jonathan Leffler, Mar 21 '11 at 04:19

Parse comma delimited file based on a start and end date

1 Answers1