0

I have inherited some code that uses regular expressions to parse CSV formatted data. It didn't need to cope with empty string fields before now, however the requirements have changed so that empty string fields are a possibility.

I have changed the regular expression from this:

new Regex("((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))");

to this

new Regex("((?<field>[^\",\\r\\n]*)|\"(?<field>([^\"]|\"\")*)\")(,|(?<rowbreak>\\r\\n|\\n|$))");

(i.e. I have changed the + to *)

The problem is that I am now getting an extra empty field at the end, e.g. "ID,Name,Description" returns me four fields: "ID", "Name", "Description" and ""

Can anyone spot why?

Simon Williams
  • 1,016
  • 3
  • 11
  • 27

3 Answers3

2

This one:

var rx = new Regex("((?<=^|,)(?<field>)(?=,|$)|(?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")*)\")(,|(?<rowbreak>\\r\\n|\\n|$))");

I move the handling of "blank" fields to a third "or". Now, the handling of "" already worked (and you didn't need to modify it, it was the second (?<field>) block of your code), so what you need to handle are four cases:

,
,Id
Id,
Id,,Name

And this one should do it:

(?<=^|,)(?<field>)(?=,|$)

An empty field must be preceeded by the beginning of the row ^ or by a ,, must be of length zero (there isn't anything in the (?<field>) capture) and must be followed by a , or by the end of the line $.

xanatos
  • 109,618
  • 12
  • 197
  • 280
1

I would suggest you to use the FileHelpers library. It is easy to use, does its job and maintaining your code will be much easier.

Paolo Tedesco
  • 55,237
  • 33
  • 144
  • 193
  • Does FileHelpers allow you to read CSV data with arbitrary fields? – Simon Williams Oct 20 '11 at 08:42
  • @EasyTimer: What do you mean with arbitrary? In any case, you can use the library to deserialize a csv file to your own classes, and the library supports optional (empty) fields as well. – Paolo Tedesco Oct 20 '11 at 08:58
  • @Paulo, the format of the CSV file is not known until runtime. i.e. We don't know what fields it might contain. My understanding is that FileHelpers is geared towards knowing the structure beforehand so that a class can be created to hold the data. – Simon Williams Oct 20 '11 at 09:45
  • @EasyTimer: Ok, I had not understood the runtime requirement! In that case, FileHelpers is not what you need. – Paolo Tedesco Oct 20 '11 at 11:17
1

The problem with your regex is that it matches the empty string. Now $ works a little like lookahead - it guarantees that the match is at the end of the string, but is not part of the match.

So when you have "ID,Name,Description", your first match is

ID,, and the rest is "Name,Description"

Then the next match is

Name, and the rest is "Description"

The next match:

Description and the rest is ""

So the final match is matching the empty string.

Petar Ivanov
  • 91,536
  • 11
  • 82
  • 95