3

I'm parsing a CSV file with embedded commas, and obviously, using split() has a few limitations due to this.

One thing I should note is that the values with embedded commas are surrounded by parentheses, double quotes, or both...

for example:

(Date, Notional), "Date, Notional", "(Date, Notional)"

Also, I'm trying to do this without using any modules for certain reasons I don't want to go into right now...

Can anyone help me out with this?

andrejr
  • 159
  • 2
  • 11

3 Answers3

3

This should do what you need. It works in a very similar way to the code in Text::CSV_PP, but doesn't allow for escaped characters within the field as you say you have none

use strict;
use warnings;
use 5.010;

my $re = qr/(?| "\( ( [^()""]* ) \)" |  \( ( [^()]* ) \) |  " ( [^"]* ) " |  ( [^,]* ) ) , \s* /x;

my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';

my @fields = "$line," =~ /$re/g;

say "<$_>" for @fields;

output

<Date, Notional 1>
<Date, Notional 2>
<Date, Notional 3>

Update

Here's a version for older Perls (prior to version 10) that don't have the regex branch reset construct. It produces identical output to the above

use strict;
use warnings;
use 5.010;

my $re = qr/(?: "\( ( [^()""]* ) \)" |  \( ( [^()]* ) \) |  " ( [^"]* ) " |  ( [^,]* ) ) , \s* /x;

my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';

my @fields = grep defined, "$line," =~ /$re/g;

say "<$_>" for @fields;
Borodin
  • 126,100
  • 9
  • 70
  • 144
3

I know you already have a working solution with Borodin's answer, but for the record there is also a simple solution with split (see the results at the bottom of the online demo). This situation sounds very similar to regex match a pattern unless....

#!/usr/bin/perl
$regex = '(?:\([^\)]*\)|"[^"]*")(*SKIP)(*F)|\s*,\s*';
$subject = '(Date, Notional), "Date, Notional", "(Date, Notional)"';
@splits = split($regex, $subject);
print "\n*** Splits ***\n";
foreach(@splits) { print "$_\n"; } 

How it Works

The left side of the alternation | matches complete (parentheses) and (quotes), then deliberately fails. The right side matches commas, and we know they are the right commas because they were not matched by the expression on the left.

Possible Refinements

If desired, the parenthess-matching portion could be made recursive to match (nested(parens))

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
0

I know that this is quite old question, but for completeness I would like to add solution from great book "Mastering Regular Expressions" by Jeffrey Friedl (page 271):

sub parse_csv {
    my $text = shift; # record containing comma-separated values
    my @fields = ( );
    my $field;
 
    chomp($text);

    while ($text =~ m{\G(?:^|,)(?:"((?>[^"]*)(?:""[^"]*)*)"|([^",]*))}gx) {
        if (defined $2) {
            $field = $2;
        } else {
            $field = $1;
            $field =~ s/""/"/g;
        }
#        print "[$field]";
        push @fields, $field;
    }
    return @fields;
}

Try it against test row:

    my $line = q(Ten Thousand,10000, 2710 ,,"10,000",,"It's ""10 Grand"", baby",10K);
    my @fields = parse_csv($line);
    my $i;

    for ($i = 0; $i < @fields; $i++) {
         print "$fields[$i],";
    }
    print "\n";