2

I'm trying to split a CSV input using the following regex:

(?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

A line/row with the data ,a,b,c results in 3 matches:

  1. ,b
  2. ,c

I'm losing/missing the ,a and I can't figure out what needs to change.

It seems to work using the Python option: https://regex101.com/r/kW3pQ6/1

Any idea how to fix it for .NET?

This might help:

(?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

Regular expression visualization

Debuggex Demo

Sean
  • 14,359
  • 13
  • 74
  • 124

3 Answers3

2

As others have suggested you should be using a class whose purpose it is to parse a CSV string. The TextFieldParser class is built into .NET. Unless you have additional requirements not mentioned in your question its probability not necessary to use an external library.

using(MemoryStream stream = new MemoryStream())
using(StreamWriter writer = new StreamWriter(stream))
{
    writer.Write(s);
    writer.Flush();
    stream.Position = 0;

    using(TextFieldParser parser = new TextFieldParser(stream)){
        parser.TextFieldType = FieldType.Delimited;
        parser.Delimiters = new string[] {","};
        parser.HasFieldsEnclosedInQuotes = true;

        while(!parser.EndOfData){ //Loop through lines until we reach the end of the file
            string[] fields = parser.ReadFields(); //This will contain your fields
        }
    }
}

https://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser%28v=vs.110%29.aspx

Neaox
  • 1,933
  • 3
  • 18
  • 29
  • This is neater than using regex for this situation. +1 to you – Kevin Avignon Feb 23 '15 at 19:54
  • could it be simpler to simply read all the file,, store it in a string array and use a char splitter at "," ? – Kevin Avignon Feb 23 '15 at 19:57
  • 1
    The TextFieldParser can read from a stream, a file or a TextReader it can't read directly from a string or string array, thats why we "load" the string into a memory stream first. Because a CSV formatted string/file can have escaped items some with quotes some without this can cause issues. The TextFieldParser is a fast way to parse a CSV string while ensuring variances like that don't trip it up. – Neaox Feb 23 '15 at 19:59
  • 1
    Great answer, thanks. I'm going to go with the CsvHelper at the answer though - as it seems more powerful. Thanks again. – Sean Feb 23 '15 at 20:08
2

Why not use a Csv NuGet package that takes into account the many nuances of CSV parsing that you are trying to solve now and others of which you don't know you need to solve yet :-)

CsvHelper is a very popular OS package:
https://www.nuget.org/packages/CsvHelper
https://github.com/JoshClose/CsvHelper

Ralph Willgoss
  • 11,750
  • 4
  • 64
  • 67
  • There is no need for an external library for something so simple. Why not use the built in `TextFieldParser` class? – Neaox Feb 23 '15 at 20:02
  • Csv parsing isn't always simple, there's many nuances - that's why the library exists. – Ralph Willgoss Feb 23 '15 at 20:03
  • The `TextFieldParser` has taken care of every "nuance" I have come across. The only thing it doesn't handle is non quote (") text qualifiers, which are not an issue in this case. – Neaox Feb 23 '15 at 20:05
0

Yes, I know regex is not the "right" answer, but it is what the question asked for and I like a good regex challenge.

NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)


When multiple lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value that contains the captured values in that line/record (and potentially an OpenValue capture group if there was an open quote at the end of the line/record).

Here's the commented pattern (test it on Regexstorm.net):

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,\r\n]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL

Here's the raw pattern without all the comments or whitespace.
(?<=\r|\n|^)(?!\r|\n|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,\r\n]+)|"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)

[Here is a visualization from Debuggex.com][3] (capture groups named for clarity): ![Debuggex.com visualization][4]

Examples on how to use the regex pattern can be found on my answer to a similar question here, or on C# pad here, or here.

Community
  • 1
  • 1
David Woodward
  • 1,265
  • 11
  • 20