2

Given the following data, I'd like a Regex to pull out each comma-separated value. However, a double-quoted value may contain commas.

"SMITH, JOHN",1234567890,"12/20/2012,11:00",,DRSCONSULT,DR BOB - OFFICE VISIT - CONSULT,SLEEP CENTER,1234567890,,,"a, b"
"JONES, WILLIAM",1234567890,12/20/2012,12:45,,DRSCONSULT,DR BOB - OFFICE VISIT - CONSULT,SLEEP CENTER,,,,

Here's the expression that I have so far:

(?<=^|,)(?:(?:(?<=\")([^\"]*)(?=\"))|(?:(?<![\"])([^,\"]*)(?![\"])))(?=$|,)

Regular expression visualization

Debuggex Demo

The double-quoted values are not being matched. What am I doing wrong? (This Regex is passed into pre-existing code - I cannot rewrite the system.)

harley.333
  • 3,696
  • 2
  • 26
  • 31
  • Running a complex RegEx on a large CSV file will be noticeably slower than other methods of string processing. – Eric J. May 07 '14 at 20:08
  • 6
    Somebody has to come along and say it, so it might as well be me: "Why don't you just use an existing CSV parser?" – Jon B May 07 '14 at 20:09
  • 1
    @JonB: He states that pre-existing code requires that a RegEx be passed in. – Eric J. May 07 '14 at 20:11
  • @EricJ. how would the existing code know how he finds the data, csv parser or regex? – Bit May 07 '14 at 20:14
  • Take a look here: http://stackoverflow.com/questions/9642055/csv-parsing-options-with-net – Pedro Lobito May 07 '14 at 20:18
  • 3
    @N4TKD: `This Regex is passed into pre-existing code - I cannot rewrite the system`. Not sure what else I can say, other than he was pretty clear about that constraint. – Eric J. May 07 '14 at 20:20
  • The problem with parsing a CSV file with regex alone is that a CSV parser requires state to understand how to interpret slashes, commas, and double quotes. Although this answer provides insight into why XML cannot be parsed by regex alone, it is still relevant to your question: http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Matthew May 07 '14 at 20:27
  • @Eric J If that is the case then get everything and use a csv parser, by send no regex at all. – Bit May 07 '14 at 20:34

3 Answers3

4

How about:

(?<=^|,)(("[^"]*")|([^,]*))(?=$|,)

Regular expression visualization

Debuggex Demo

The first alternative is:

("[^"]*")

Match a " followed by anything that's not a " followed by a "

The second alternative is just:

([^,]*)

Match anything that isn't a ,

Matt Burland
  • 44,552
  • 18
  • 99
  • 171
0

This pattern should work:

(\w+\,\s\w+|[\d\/]*\,\d+\:\d*|[\w\d\:\s\-]+)

example:

http://regex101.com/r/rI8nS1

When using the pattern in C# you might need to escape it llke:

Match match = Regex.Match(searchText, "(?m)(?x)(\\w+\\,\\s\\w+|[\\d\\/]*\\,\\d+\\:\\d*|[\\w\\d\\:\\s\\-]+)"); 
    if (match.Success) {...}
l'L'l
  • 44,951
  • 10
  • 95
  • 146
0

Here's the code which I use for coping with quote-aware CSVs

//regex to translate a CSV
readonly Regex csvParser = new Regex( "(?:^|,)(\\\"(?:[^\\\"]+|\\\"\\\")*\\\"|[^,]*)", RegexOptions.Compiled);

//given a row from the csv file, loop through returning an array of column values
private IEnumerable<string> ProcessCsvRow(string row)
{
    MatchCollection results = csvParser.Matches(row);
    foreach (Match match in results)
    {
        foreach (Capture capture in match.Captures)
        {
            yield return (capture.Value ?? string.Empty).TrimStart(",").Trim('"', ' ');
        }
    }
}
JohnLBevan
  • 22,735
  • 13
  • 96
  • 178