-1

I have happened upon a problem with a program that parses through a CSV file with a few million records: two fields in each line has comments that users have put in, and sometimes they use commas within their comments. If there are commas input, that field will be contained in double quotes. I need to replace any commas found in those fields with a space. Here is one such line from the file to give you an idea -

1925,47365,2,650187016,1,1,"MADE FOR DRAWDOWNS, NEVER P/U",16,IFC 8112NP,Standalone-6,,,44,10/22/2015,91607,,B24W02651,,"PA-3, PURE",4/28/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,C48A0D001EF449E3AB97F0B98C811B1B,POS.MISTINT.V0000.UP.Q,PROD_SMISA_BK,414D512050524F445F504F5331393235906F28561D2F0020,10/22/2015 9:29,10/22/2015 9:30

NOTE - I do not have the Text::CSV module available to me, nor will it be made available in the server I am using.

Here is part of my code in parsing this file. The first thing I do is concatenate the very first three fields and prepend that concatenated field to each line. Then I want to clear out the commas in @fields[7,19], then format the DATE in three fields and the DATETIME in two fields. The only line I can't figure out is clearing out those commas -

my @data;

# Read the lines one by one.

while ( $line = <$FH> ) {

    # split the fields, concatenate the first three fields,
    # and add it to the beginning of each line in the file

    chomp($line);
    my @fields = split(/,/, $line);

    unshift @fields, join '_', @fields[0..2];

    # remove user input commas in fields[7,19]

    $_ =                 for fields[7,19]; 

    # format DATE and DATETIME fields for MySQL/sqlbatch60

    $_ = join '-', (split /\//)[2,0,1] for @fields[14,20,23];
    $_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for @fields[38,39];

    # write the parsed record back to the file

    push @data, \@fields;
}
BigRedEO
  • 807
  • 4
  • 13
  • 33
  • 3
    How do you correctly parse a CSV file? You use a well-designed, well-tested module like [Text::CSV](https://metacpan.org/pod/Text::CSV). – Matt Jacob Apr 12 '16 at 18:58
  • 1
    The "_up to_" 30 is a problem -- what if it's 10 chars, with some commas, and then a legitimate comma (field delimiter)? Is there anything that you know with certainty about any specific fields after field 7? (For example, "ah, field 10 must always begin with ...") – zdim Apr 12 '16 at 19:52
  • @zdim - Now I'm being told there WILL be double quotes around that field whenever a comma was input by the user. Can **tr/,/ /;** be used? – BigRedEO Apr 12 '16 at 20:19
  • Please clarify. You are talking about the *eighth* field (index 7) that contains `MADE FOR DRAWDOWNS, NEVER P/U`. Is that right? Do you know how many fields there should be in total? Can only the eighth field be entered manually like this? – Borodin Apr 12 '16 at 23:20
  • I hope you realise that this is malformed input data and is unacceptable? If your changes are just a temporary solution then that's fine, but the fixes must be done on the *input* software: that is the single source of the problem. Left uncorrected there may be endless programs like yours that will have to correct the same issue. Please don't let that happen. – Borodin Apr 12 '16 at 23:33
  • 1
    @BigRedEO Is the field quoted or not? One place in your question you say there are quotes, another place you say there aren't quotes, and your comment says there are quotes. Please edit your question to be consistent everywhere; it's very confusing right now. You don't need to add new blocks that say "Update" or "Edit", since we can all see the edit history by clicking the link beneath the post; instead, replace information that's no longer relevant, and make sure you don't contradict yourself in different parts of the post. – ThisSuitIsBlackNot Apr 13 '16 at 01:28
  • 1
    I have nothing to do with how the data comes to me, "malformed" or not, unfortunately. After being told different things yesterday, last I was told was that whenever there is a comma in field[7] (zero-based index), there will be double quotes around that field in the record/file. It is also the only field in each record that is manually input by a user every time. – BigRedEO Apr 13 '16 at 12:29
  • Even if they say "yes quotes," given all that flip-flopping and the existence of user's input, I'd suggest to keep it safe, too, if possible -- is the number of fields in the file supposed to be fixed? (Even if it differs from file to file.) If not, are there some immutable phrases in the fields following this one, certain to be there? – zdim Apr 13 '16 at 18:55
  • @zdim - the number of fields will always be fixed - always be the same. That one field is the only one with user input and if there is a comma in it, there will be double quotes around that field. I think you had posted something yesterday with deleting any comments between double quotes that might work - just have to learn how to use it with a scalar variable as with the other fields I manipulated in the code above (the joins and the time formatting). – BigRedEO Apr 13 '16 at 19:51
  • @MattJacob - I tried Text::CSV - got this error - `Can't locate Text/CSV.pm in @INC (@INC contains: /usr/perl5/site_perl/5.12/sun4-solaris-64int /usr/perl5/site_perl/5.12 /usr/perl5/vendor_perl/5.12/sun4-solaris-64int /usr/perl5/vendor_perl/5.12 /usr/perl5/5.12/lib/sun4-solaris-64int /usr/perl5/5.12/lib .) at ReadFile.pl line 5.` I'm assuming this means we don't even have Text::CSV loaded. Since I'm a total noob to Perl, is it easy to load a module? (Don't know if I have permissions to do so on the server I'm using) – BigRedEO Apr 14 '16 at 15:01
  • @BigRedEO You probably don't have the module installed. Installing a module is [fairly easy](http://stackoverflow.com/q/65865/82262) – Matt Jacob Apr 14 '16 at 16:56
  • I have asked several questions about the files being given to me and got clarification yesterday. I have edited my original question to correctly reflect what is happening. I tried many different things yesterday, including tr, sed and awk, all without any success. I have yet to find an answer that works in just searching with Google. – BigRedEO Apr 15 '16 at 16:25

3 Answers3

1

If it is ONLY the eighth field that is troubling AND you know exactly how many fields there should be, you can do it this way

Suppose the total number of fields is always N

  • Split the line on commas ,
  • Separate and store the first six fields
  • Separate and store the last n fields, where n is N-8
  • Rejoin what remains with commas ,. This now forms field 8

and then do what ever you like to do with it. For example, write it to a proper CSV file

Borodin
  • 126,100
  • 9
  • 70
  • 144
vanHoesel
  • 956
  • 9
  • 22
  • I have edited your solution as I think your idea is excellent and it needed to be explained a little better. Please roll back my edit if you don't agree with my changes – Borodin Apr 12 '16 at 23:30
0

Text::CSV_XS handles quoted commas just fine:

#!/usr/bin/perl
use warnings;
use strict;

use Text::CSV_XS qw{ csv };

my $aoa = csv(in => 'file.csv');  # The file contains the sample line.
print $aoa->[0][6];
choroba
  • 231,213
  • 25
  • 204
  • 289
0

Note   The two main versions below clean up one field. The most recent change in the question states that there are, in fact, two such fields. The third version, at the end, works with any number of bad fields.

All code has been tested with the supplied example and its variations.


Following clarifications, this deals with the case when the file need be processed by hand. A module is easily recommended for parsing .csv, but there is a problem here: reliance on the user to enter double quotes. If they end up not being there we have a malformed file.

I take it that the number of fields in the file is known with certainty and ahead of time.

The two independent solutions below use either array or string processing.

(1) The file is being processed line by line anyway, the line being split already. If there are more fields than expected, join the extra array elements by space and then overwrite the array with correct fields. This is similar to what is outlined in the answer by vanHoesel.

use strict;
use warnings;

my $num_fields = 39;  # what should be, using the example
my $ibad = 6;         # index of the malformed field-to-be
my @last = (-($num_fields-$ibad-1)..-1);  # index-range, rest of fields

my $file = "file.csv";
open my $fh, '<', $file;

while (my $line = <$fh>) {   # chomp it if needed
   my @fields = split ',', $line;
   if (@fields != $num_fields) {
       # join extra elements by space
       my $fixed = join ' ', @fields[$ibad..$ibad+@fields-$num_fields];
       # overwrite array by good fields
       @fields = (@fields[0..$ibad-1], $fixed, @fields[@last]);
   }   
   # Process @fields normally
   print "@fields";
}   
close $fh;

(2) Preprocess the file, only checking for malformed lines and fixing them as needed. Uses string manipulations. (Or, the method above can be used.) The $num_fields and $ibad are the same.

while (my $line = <$fh>) {
    # Number of fields: commas + 1  (tr|,|| counts number of ",")
    my $have_fields = $line =~ tr|,|| + 1;
    if ($have_fields != $num_fields) {   
        # Get indices of commas delimiting the bad field
        my ($beg, $end) = map {
            my $p = '[^,]*,' x $_; 
            $line =~ /^$p/ and $+[0]-1; 
        } ($ibad, $ibad+$have_fields-$num_fields);

        # Replace extra commas and overwrite that part of the string
        my $bad_field = substr($line, $beg+1, $end-$beg-1);
        (my $fixed = $bad_field) =~ tr/,/ /;
        substr($line, $beg+1, $end-$beg-1) = $fixed;
    }   
    # Perhaps write the line out, for a corrected .csv file
    print $line;
}   

In the last line the bad part of $line is overwritten by assigning to substr, what this function allows. The new substring $fixed is constructed with commas changed (or removed, if desired), and used to overwrite the bad part of the $line. See docs.


If quotes are known to be there a regex can be used. This works with any number of bad fields.

while (my $line = <$fh>) {
    $line =~ s/."([^"]+)"/join ' ', split(',', $1)/eg;  # "
    # process the line. note that double quotes are removed
}

If the quotes are to be kept move them inside parenthesis, to be captured as well.

This one line is all that need be done after while (...) { to clean up data.

The /e modifier makes the replacement side be evaluated as code, instead of being used as a double-quoted string. There the matched part of the line (between ") is split by comma and then joined by space, thus fixing the field. See the last item under "Search and replace" in perlretut.

All code has been tested with multiple lines and multiple commas in the bad field.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Only just got back from four days away. So, would the regex line be used in place of the substr line? – BigRedEO Apr 20 '16 at 14:09
  • @BigRedEO It is the only thing you need once the `while` loop starts, that one line. I've added that to the answer. so that it is complete. Note that I also added the `/g` modifier so that this will fix all bad fields (with double quotes) on a line. Let me know. – zdim Apr 21 '16 at 06:29
  • @BigRedEO I don't know whether you noticed that I updated my answer to clarify your question. There are now three ways to solve it under different circumstances, all tested. I am curious, what happened with this in the end? – zdim May 07 '16 at 08:57