9

I'm trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:

1,"abcd",2,"de,fg",3,"hijk"

I want to match all of the commas except for the one between the 'e' and 'f'. Alternatively, matching just that one is acceptable, if that is the easier or more sensible solution. I have the sense that I need to use a negative lookahead assertion to handle this, but I'm finding it a bit too difficult to figure out.

Kespan
  • 149
  • 1
  • 1
  • 7
  • out of curiosity, why match commas? what are you trying to achieve? – Fredrik Pihl Jun 21 '11 at 21:55
  • 1
    I'd suggest avoiding regex altogether for this problem - and go for the approach of splitting the string on _all_ commas, and then make a pass through the split-out components and merge those that start/end with quotes (removing the quotes in the process). Whilst this is _slightly_ oversimplified in one sentence, it's still a simple process. – Will A Jun 21 '11 at 21:57
  • 2
    @Will A - that approach will fall down when the quoted string contains a newline. And it doesn't deal with quoted fields that contain quotes. Parsing CSV is a job for a parser. – Grant McLean Jun 22 '11 at 01:08
  • @Grant - newlines within strings. Sick - didn't realise that CSV supported that! I'm sure that my approach could be tailored to cope with this - but I guess the more complex it gets the less attractive it is! – Will A Jun 22 '11 at 17:26

6 Answers6

9

See my post that solves this problem for more detail.

^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.

Community
  • 1
  • 1
agent-j
  • 27,335
  • 5
  • 52
  • 79
5

CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.

What language are you using?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • Frederick/Andy: I'm trying to pass lines from a CSV through sed (ideally) before further processing. I need to replace comma delimiters with another character before another script can process this. – Kespan Jun 21 '11 at 22:01
5

As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.

  • I suggest the tool CSVFIX as likely to do what you need.

To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):

"""",,"",a,"a,b"

Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:

"",,"",a",b c",

The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!

This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:

use strict;
use warnings;

my @list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );

foreach my $string (@list)
{
    print "Pattern: <<$string>>\n";
    while ($string =~ m/ (?: " ( (?:""|[^"])* ) "  |  ( [^,"] [^,]* )  |  ( .? ) )
                         (?: $ | , ) /gx)
    {
        print "Found QF: <<$1>>\n" if defined $1;
        print "Found PF: <<$2>>\n" if defined $2;
        print "Found EF: <<$3>>\n" if defined $3;
    }
}

Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!

Output:

Pattern:  <<"""",,"",a,"a,b">>
Found QF: <<"">>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a>>
Found QF: <<a,b>>
Found EF: <<>>
Pattern:  <<"",,"",a",b c",>>
Found QF: <<>>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a">>
Found PF: <<b c">>
Found EF: <<>>

We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct. Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
0

I know this is old, but this RegEx works for me:

/(\"[^\"]+\")|[^,]+/g

It could be use potentially with any language. I tested it in JavaScript, so the g is just a global modifier. It works even with messed up lines (extra quotes), but empty is not dealt with.

Just sharing, maybe this will help someone.

pqsk
  • 2,124
  • 3
  • 23
  • 28
0

Without thinking to hard, I would do something like [0-9]+|"[^"]*" to match everything except the comma delimiters. Would that do the trick?

Without context it's impossible to give a more specific solution.

  • No, because that wouldn't match the comma between the 'e' and 'f', even though that's not a delimiter. I should have been more clear - a comma that exists between double quotes is not considered a delimiter in this case. Also, the format I gave is not meant to imply that every line will look exactly the same, but rather that a line consists of numbers and double-quoted strings, delimited by commas. – Kespan Jun 21 '11 at 21:55
  • Sorry if I wasn't clear; it's hard to know your use case without more info. Normally, when someone parses is dealing with CSV what they *actually* want is the values, not the delimiters. A MatchAll-type function with this regex would give you a list/array of all the items in the CSV file. If, as you said in another comment, you're using `sed` to change the commas to something else, why not instead use `grep -eo` to get newline-delimited values? – Anschel Schaffer-Cohen Jun 22 '11 at 16:17
0

Andy's right: correctly parsing CSV is a lot harder than you probably realise, and has all kinds of ugly edge cases. I suspect that it's mathematically impossible to correctly parse CSV with regexes, particularly those understood by sed.

Instead of sed, use a Perl script that uses the Text::CSV module from CPAN (or the equivalent in your preferred scripting language). Something like this should do it:

use Text::CSV;
use feature 'say';

my $csv = Text::CSV->new ( { binary => 1, eol => $/ } )
    or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $rows = $csv->getline_all(STDIN);
for my $row (@$rows) {
    say join("\t", @$row);
}

That assumes that you don't have any tab characters embedded in your data, of course - perhaps it would be better to do the subsequent stages in a Real Scripting Language as well, so you could take advantage of proper lists?

user98743
  • 23
  • 2