0

Possible Duplicate:
How can I parse quoted CSV in Perl with a regex?

I am attempting to take a CSV file and import each row into an array (where each element represents a column). The format of a CSV file is very simple:

item1,item2,item3
nextrowitem1,item2,item3
"items,with,commas","are,in,quotes"

I imported the CSV file using:

open(FILE, "test.csv");
@lines = <FILE>;

Then I looped through it using:

foreach(@lines){
    @items = split(/regular expression/);
    /*Do stuff with @items array*/
}

(Note that you do not need to use split(/regular expression, $string); because split() assumes $_ if no string is supplied)

Before I tested the file using a CSV file where none of the items contained commas and the simple regular expression of split(/,/). This worked just fine, so there is nothing wrong with the file, reading it, or my loop after this regular expression. However when I hit items that contained a comma they got understandably divided like so:

1 => "items
2 => with
3 => commas"
4 => "are
5 => in
6 => quotes"

Instead of the desired:

1 => items,with,commas
2 => are,in,quotes

Can anyone help me develop a regular expression to split this array correctly? Basically if the item starts with a quote ("), it needs to wait until "," to split. If the item does not start with a quote, it needs to wait until , to split.

Community
  • 1
  • 1
stevendesu
  • 15,753
  • 22
  • 105
  • 182
  • 1
    Thank you for linking to that =) Definitely a duplicate - in fact, his question went into even more detail than mine. I will look into CPAN now to see if I can make use of it. – stevendesu Jun 22 '11 at 02:20

3 Answers3

5

Try reading Text::CSV as a possible option that already does this. The problem with doing parsing of a CSV into a regular expression is that you have to look for things like "," (which you indicated) as well as just a , separation.

cjm
  • 61,471
  • 9
  • 126
  • 175
Suroot
  • 4,315
  • 1
  • 22
  • 28
  • I agree. You can't do CVS splitting with a regular expression since commas and quotes can be in a CVS field. The only real way is to break up the line bit by bit in a loop. Text::CVS does the magic for you. – David W. Jun 22 '11 at 04:34
  • @David, [CVS](http://en.wikipedia.org/wiki/Concurrent_Versions_System) and [CSV](http://en.wikipedia.org/wiki/Comma-separated_values) are _very_ different. – cjm Jun 22 '11 at 05:43
  • Yes, I know. Unfortunately, I'm dyslexic and am working with a project which uses CVS. The confusion was bound to happen. I meant "CSV". – David W. Jun 22 '11 at 16:01
5

Just use Text::CSV_XS instead...

ysth
  • 96,171
  • 6
  • 121
  • 214
  • Or use [Text::CSV](http://search.cpan.org/perldoc?Text::CSV), which uses Text::CSV_XS for speed if it can, but also has a pure-Perl implementation in case you don't have a C compiler. – cjm Jun 22 '11 at 05:47
-1

See my post that solves this problem for more detail.

^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use the matched captures to get your data out (without the quotes).

Community
  • 1
  • 1
agent-j
  • 27,335
  • 5
  • 52
  • 79
  • What does it mean when an expression starts with a question mark? I know that `^ab?` will match `a` or `ab`, but what's the significance of `^(?:....`? I've never seen a question mark at the beginning... – stevendesu Jun 22 '11 at 02:44
  • This is a job for a CSV module as suggested by other answers not a regex – justintime Jun 22 '11 at 03:56
  • `(?:expression)` means a non-capturing group. This prevents the regex engine from tracking parts of the string that match that part of the expression. Look at $1, $2, $3, etc. here: http://www.regular-expressions.info/perl.html – agent-j Jun 22 '11 at 11:18