12

I'm having some issues with parsing CSV data with quotes. My main problem is with quotes within a field. In the following example lines 1 - 4 work correctly but 5,6 and 7 don't.

COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,

I'd like to avoid Text::CSV as it isn't installed on the target server. Realising that CSV's are are more complicated than they look I'm using a recipe from the Perl Cookbook.

sub parse_csv {
  my $text = shift; #record containg CSVs
  my @columns = ();
  push(@columns ,$+) while $text =~ m{
    # The first part groups the phrase inside quotes
    "([^\"\\]*(?:\\.[^\"\\]*)*)",?
      | ([^,]+),?
      | ,
    }gx;
  push(@columns ,undef) if substr($text, -1,1) eq ',';
  return @columns ; # list of vars that was comma separated.
}

Does anyone have a suggestion for improving the regex to handle the above cases?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Mark Nold
  • 5,638
  • 7
  • 31
  • 33

7 Answers7

35

Please, Try Using CPAN

There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.

If you can't store text files in your project, then I'm wondering how it is you are coding your project.

http://novosial.org/perl/life-with-cpan/non-root/

Should be a good guide on how to get these into a working state locally.

Not using CPAN is really a recipe for disaster.

Please consider this before trying to write your own CSV implementation.

Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.

note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.

Ether
  • 53,118
  • 13
  • 86
  • 159
Kent Fredric
  • 56,416
  • 14
  • 107
  • 150
  • @Kent, thanks... my main aversion to Text::CSV was the difficulty installing at the other.. ie: do they have a compiler (not all un*xs come with one) etc. But i rechecked (due to your first post) and there is a Pure Perl implementation. CSV_PP. Thanks. – Mark Nold Mar 11 '09 at 08:06
  • 2
    You might also want to look at [Text::xSV](http://p3rl.org/Text::xSV) , which is also Pure Perl – Kent Fredric Mar 11 '09 at 08:19
  • I'll have to check that out as Text:CSV_PP doesn't work with the fifth case, even with allow_loose_quotes and escape_char set. Thanks again. – Mark Nold Mar 12 '09 at 02:23
  • 6
    I've had to deal with junior programmers demanding every CPAN module under the sun be installed when a simple regexp would have sufficed. The Perl community advocates "there is more than one way to do it" so spray painting LARGE BOLD LETTERS as if there is ONLY ONE APPROACH does not help those who may actually want to discover an alternative method. – PP. Feb 09 '10 at 11:48
  • "when a simple regexp would have sufficed" -- without seeing the examples I can't know for sure, but my experience is generally indicative of that statement being wrong. Often when you think a Regexp would be sufficient you simply don't know the problem domain well enough to know differently. This is why modules exist, so people who are experts on the problem domain can solve it the best way, and solve the problems you don't even realise exist yet. – Kent Fredric Jun 29 '11 at 19:22
  • 2
    @PP. that's not half as bad as the more common situation, which is some pain in the ass denying the use of a module for no remotely sensible reason because they neither understand the myriad ways that "a simple regex" fails to do the job well, nor have the ambition to learn. – hobbs Mar 19 '16 at 21:43
26

You can parse CSV using Text::ParseWords which ships with Perl.

use Text::ParseWords;

while (<DATA>) {
    chomp;
    my @f = quotewords ',', 0, $_;
    say join ":" => @f;
}

__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,

which parses your CSV correctly....

# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:

The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)

So you may notice that

# S,"YBELT,"V"",000323030,

came out as (ie. quotes dropped around "V")

# S:YBELT,V:000323030:

however if its escaped like so

# S,"YBELT,\"V\"",000323030,

then quotes will be retained

# S:YBELT,"V":000323030:
AndyG
  • 39,700
  • 8
  • 109
  • 143
draegtun
  • 22,441
  • 5
  • 48
  • 71
  • FYI Text::ParseWords is included in all versions of Perl 5: perl -MModule::CoreList -l -e'print Module::CoreList->first_release_by_date("Text::ParseWords");' printd 5.000 – mirod Mar 11 '09 at 10:27
  • @draegtun and @mirod thanks for the pointer. I'm testing this out now, thanks. – Mark Nold Mar 12 '09 at 03:25
  • 1
    Unfortunately, unless you have a state-maintaining parser, you can't parse *all* CSV on a line-by-line basis. Some CSV's have literal line feeds inside quoted strings, and this once fact makes parsing CSV a nightmare. ie: If you can solve the line feed issue yourself, you get this: https://gist.github.com/1329430 , but when you apply the code as you suggested on literal data; you get this monstrosity https://gist.github.com/1329436 . Subtleties like this are why you need a real parser ;) – Kent Fredric Oct 31 '11 at 23:54
  • This fails if the string contains 2 quotes in a row, eg "he said "hello"" – MikeKulls Feb 04 '13 at 04:14
  • @MikeKulls - That's invalid CSV so i'm not surprised it fails. – draegtun Oct 02 '18 at 15:43
  • @draegtun there is no CSV standard but using 2 sets of double quotes to represent a single quote is common. – MikeKulls Oct 10 '18 at 10:14
  • @MikeKulls - I know but your example is invalid because it should have been - `"he said ""hello""" ` – draegtun Oct 11 '18 at 19:04
2

tested; working:-

$_.=','; # fake an ending delimiter

while($_=~/"((?:""|[^"])*)",|([^,]*),/g) {
  $cell=defined($1) ? $1:$2; $cell=~s/""/"/g; 
  print "$cell\n";
}

# The regexp strategy is as follows:
# First - we attempt a match on any quoted part starting the CSV line:-
#  "((?:""|[^"])*)",
# It must start with a quote, and end with a quote followed by a comma, and is allowed to contain either doublequotes - "" - or anything except a sinlge quote [^"] - this goes into $1
# If we can't match that, we accept anything up to the next comma instead, & put it into $2
# Lastly, we convert "" to " and print out the cell.

be warned that CSV files can contain cells with embedded newlines inside the quotes, so you'll need to do this if reading the data in line-at-a-time:

if("$pre$_"=~/,"[^,]*\z/) {
  $pre.=$_; next;
}
$_="$pre$_";
Community
  • 1
  • 1
Chris
  • 21
  • 2
  • Would you be able to reorganise that code into a function of sorts that can be applied to a body of text somehow? I'd love to test it to see how it plays with my sample data like I have with the others ( ie: https://gist.github.com/1329456 ). =) – Kent Fredric Nov 01 '11 at 00:14
2

This works like charm

line is assumed to be comma separated with embeded ,

my @columns = Text::ParseWords::parse_line(',', 0, $line);

1

Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.

Eugene Morozov
  • 15,081
  • 3
  • 25
  • 32
  • "unsolvable"? You can use regexes to find matching quotes easily! It's parenthesis that regexes can't do, not because they're matching, but because they're NESTED matching. You can't (generally) nest quotes. (You can have \", but that doesn't start a new string inside the old one, now does it?) – Chris Lutz Mar 11 '09 at 07:22
  • Thanks Eugene, funnily enough i'm pretty sure the Perl Cookbook example is taken from MRE :) I will double check though. – Mark Nold Mar 11 '09 at 07:28
  • 1
    /((?:[^\n,"]|"(?:[^"]|"")+")+),/g should be closer to what the OP wants, but is by my own admission still imperfect. – Chris Lutz Mar 11 '09 at 07:31
  • Yes, I believe quotes, as-is, inside a quoted area is invalid CSV, there's some escaping mechansim required otherwise working it out by guesswork is simply impossible. – Kent Fredric Mar 11 '09 at 07:35
0

Tested:


use Test::More tests => 2;

use strict;

sub splitCommaNotQuote {
    my ( $line ) = @_;

    my @fields = ();

    while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
        if ( $2 ) {
            push( @fields, $3 );
        } else {
            push( @fields, $1 );
        }
        last if ( ! $4 );
    }

    return( @fields );
}

is_deeply(
    +[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
    +['S', '"D" REL VALVE ASSY', '000771881', ''],
    "Quote in value"
);
is_deeply(
    +[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
    +['S', 'BELT V,FAN', '000324244', ''],
    "Strip quotes from entire value"
);
PP.
  • 10,764
  • 7
  • 45
  • 59
  • I know its not listed in the sample dataset, but what about linefeeds in the middle of quoted string fields? Does your code work in such a situation? Did you realise linefeeds are permissible in quoted strings in CSV? How much time will you waste trying to re-work and re-test your code to handle that edge case?Having had to implement a CSV parser before, I can spout forth numerous scenarios that will break a naïve parser, and I assure you you'll have a nasty bogload of unmaintainable code quicky, when you could have just installed and used something that existed and go work on something else. – Kent Fredric Jun 29 '11 at 19:29
  • I provided tested code. You did not. Some people find regular expressions difficult. That's okay. I quite like, and enjoy, regular expressions (probably why I use Emacs). – PP. Jun 30 '11 at 08:31
  • https://gist.github.com/1329456 # your code on a sample dataset containing a line feed in the middle of the field. Enjoy =). – Kent Fredric Nov 01 '11 at 00:04
0

You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to @INC (or, if you prefer not to use BEGIN blocks, you can use lib 'dir'; - it's probably better).

Chris Lutz
  • 73,191
  • 16
  • 130
  • 183