11

In C#, using the Regex class, how does one parse comma-separated values, where some values might be quoted strings themselves containing commas?

using System ;
using System.Text.RegularExpressions ;

class  Example
    {
    public static void Main ( )
        {
        string  myString  =  "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" ;
        Console.WriteLine ( "\nmyString is ...\n\t" + myString + "\n" ) ;
        Regex   regex  =  new Regex  (  "(?<=,(\"|\')).*?(?=(\"|\'),)|(^.*?(?=,))|((?<=,).*?(?=,))|((?<=,).*?$)"  )  ;
        Match   match  =  regex.Match ( myString ) ;
        int j = 0 ;
        while ( match.Success )
            {
            Console.WriteLine ( j++ + " \t" + match ) ;
            match  =  match.NextMatch() ;
            }
        }
    }

Output (in part) appears as follows:

0       cat
1       dog
2       "0 = OFF
3        1 = ON"
4       lion
5       tiger
6       'R = red
7        G = green
8        B = blue'
9       bear

However, desired output is:

0       cat
1       dog
2       0 = OFF, 1 = ON
3       lion
4       tiger
5       R = red, G = green, B = blue
6       bear
JaysonFix
  • 2,515
  • 9
  • 28
  • 28
  • 9
    Rather than starting with the solution -- regular expressions -- and trying to apply the problem to it, consider going the other way. Start with the problem and pick the tool that clearly solves it well. I personally would solve this problem by writing a lexer for your language. That code is likely to be more clear, understandable, debuggable and extensible than the modem line noise that is a regular expression. – Eric Lippert Jul 27 '09 at 17:32

9 Answers9

24

Try with this Regex:

"[^"\r\n]*"|'[^'\r\n]*'|[^,\r\n]*

    Regex regexObj = new Regex(@"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]*");
    Match matchResults = regexObj.Match(input);
    while (matchResults.Success) 
    {
        Console.WriteLine(matchResults.Value);
        matchResults = matchResults.NextMatch();
    }

Ouputs:

  • cat
  • dog
  • "0 = OFF, 1 = ON"
  • lion
  • tiger
  • 'R = red, G = green, B = blue'
  • bear

Note: This regex solution will work for your case, however I recommend you to use a specialized library like FileHelpers.

Christian C. Salvadó
  • 807,428
  • 183
  • 922
  • 838
  • For me, your latest edited answer yields the following output: 0 cat 1 2 dog 3 4 "0 = OFF, 1 = ON" 5 6 lion 7 8 tiger 9 10 'R = red, G = green, B = blue' 11 12 bear 13 – JaysonFix Jul 27 '09 at 17:32
  • 2
    @CMS: I was able to get your idea to work for me, but I had to make a small modification (replacing * by + at the end). My Regex is ... new Regex ( @"""[^""\r\n]*""|'[^'\r\n]*'|[^,\r\n]+" ) – JaysonFix Jul 27 '09 at 17:45
  • Almost forgot: Thank you, CMS, and others for your valuable input. – JaysonFix Jul 27 '09 at 17:46
  • 1
    Whoops ... spoke too soon ... my "solution" does not work when some fields are "empty" ... see, for example, the two consecutive commas following cat: "cat,,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear" – JaysonFix Jul 27 '09 at 17:49
  • 1
    CSV isn't regular - regexes will *never* be able to parse all valid CVS files. like ever. This is madness except as a dirty hack... – ShuggyCoUk Jul 27 '09 at 17:53
  • Here is what I have finally settled on ... string regexString = "" + "((?<=\")[^\"]*(?=\"))" // " ... to ... " + "|((?<=\')[^\']*(?=\'))" // ' ... to ... ' + "|(^[^,]*(?=,))" // ^ ... to ... , + "|((?<=,)[^,]*$)" // , ... to ... $ + "|(?<=,).{0}(?=,)" // ,, + "|(?<=,)([^\"\'].*?)(?=,)" // , ... ( excluding " ' ) to ... , ; Regex regex = new Regex ( regexString ) ; – JaysonFix Jul 27 '09 at 20:25
  • @JaysonFix would you mind posting that solution in your question? I've got exactly the same problem and your solution is a little difficult to read in a comment like that. – Owen Nov 01 '13 at 12:15
22

Why not heed the advice from the experts and Don't roll your own CSV parser.

Your first thought is, "I need to handle commas inside of quotes."

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free and open source FileHelpers library.

Judah Gabriel Himango
  • 58,906
  • 38
  • 158
  • 212
  • 1
    This was my very first thought, too. – codekaizen Jul 27 '09 at 17:47
  • 2
    Another reason: the CSV spec is actually surprisingly complex, considering all the different ways of quoting and delimiting values. Not to mention the fact that there *isn't* actually a spec for CSV … (There's RfC4180, but that explicitly states that it is only an "attempt" at a common spec, and that actual implementations vary.) So, chances are, if you roll your own, you'll do it wrong. – Jörg W Mittag Jul 27 '09 at 17:59
  • I found CsvHelper to work great and it's in NuGet (http://joshclose.github.io/CsvHelper/). It's licensed MS-PL which is less scary than the LGPL licensed FileHelpers library if you have a commercial product. – Steve Hiner Oct 09 '14 at 00:50
  • After I discovered that this wasn't an ad, but a serious library, I found that taking the snippets that I needed, and use them got me the best solution: The smallest library of what I needed, and reliable code when using FileHelpers. – Brian Webb May 02 '16 at 18:12
8

it's not a regex, but I've used Microsoft.VisualBasic.FileIO.TextFieldParser to accomplish this for csv files. yes, it might feel a little strange adding a reference to Microsoft.VisualBasic in a C# app, maybe even a little dirty, but hey it works.

kenwarner
  • 28,650
  • 28
  • 130
  • 173
  • +1 Excellent solution in my case. It is installed with the framework so no extra assemblies to add to the installer and no additional bunch of source files for a simple task. – Emile Mar 30 '12 at 18:39
8

Ah, RegEx. Now you have two problems. ;)

I'd use a tokenizer/parser, since it is quite straightforward, and more importantly, much easier to read for later maintenance.

This works, for example:

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green,     B = blue',bear"; 
        Console.WriteLine("\nmyString is ...\n\t" + myString + "\n");
        CsvParser parser = new CsvParser(myString);

        Int32 lineNumber = 0;
        foreach (string s in parser)
        {
            Console.WriteLine(lineNumber + ": " + s);
        }

        Console.ReadKey();
    }
}

internal enum TokenType
{
    Comma,
    Quote,
    Value
}

internal class Token
{
    public Token(TokenType type, string value)
    {
        Value = value;
        Type = type;
    }

    public String Value { get; private set; }
    public TokenType Type { get; private set; }
}

internal class StreamTokenizer : IEnumerable<Token>
{
    private TextReader _reader;

    public StreamTokenizer(TextReader reader)
    {
        _reader = reader;    
    }

    public IEnumerator<Token> GetEnumerator()
    {
        String line;
        StringBuilder value = new StringBuilder();

        while ((line = _reader.ReadLine()) != null)
        {
            foreach (Char c in line)
            {
                switch (c)
                {
                    case '\'':
                    case '"':
                        if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Quote, c.ToString());
                        break;
                    case ',':
                       if (value.Length > 0)
                        {
                            yield return new Token(TokenType.Value, value.ToString());
                            value.Length = 0;
                        }
                        yield return new Token(TokenType.Comma, c.ToString());
                        break;
                    default:
                        value.Append(c);
                        break;
                }
            }

            // Thanks, dpan
            if (value.Length > 0) 
            {
                yield return new Token(TokenType.Value, value.ToString()); 
            }
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

internal class CsvParser : IEnumerable<String>
{
    private StreamTokenizer _tokenizer;

    public CsvParser(Stream data)
    {
        _tokenizer = new StreamTokenizer(new StreamReader(data));
    }

    public CsvParser(String data)
    {
        _tokenizer = new StreamTokenizer(new StringReader(data));
    }

    public IEnumerator<string> GetEnumerator()
    {
        Boolean inQuote = false;
        StringBuilder result = new StringBuilder();

        foreach (Token token in _tokenizer)
        {
            switch (token.Type)
            {
                case TokenType.Comma:
                    if (inQuote)
                    {
                        result.Append(token.Value);
                    }
                    else
                    {
                        yield return result.ToString();
                        result.Length = 0;
                    }
                    break;
                case TokenType.Quote:
                    // Toggle quote state
                    inQuote = !inQuote;
                    break;
                case TokenType.Value:
                    result.Append(token.Value);
                    break;
                default:
                    throw new InvalidOperationException("Unknown token type: " +    token.Type);
            }
        }

        if (result.Length > 0)
        {
            yield return result.ToString();
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}
codekaizen
  • 26,990
  • 7
  • 84
  • 140
  • 1
    There's a bug in this implementation. In IEnumerator GetEnumerator() method, you need to add "if (value.Length > 0) { yield return new Token(TokenType.Value, value.ToString()); }" within the while loop, at the end. Otherwise you are missing the final token – dpan Feb 22 '12 at 10:35
  • Thanks @dpan, fixed – codekaizen Jul 07 '16 at 03:08
7

Just adding the solution I worked on this morning.

var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");

foreach (Match m in regex.Matches("<-- input line -->"))
{
    var s = m.Value; 
}

As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.

This is still a work in progress, but it happily parses CSV strings like:

2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
Joshua
  • 4,099
  • 25
  • 37
  • 2
    @Joshua, This is the best regexfor splitting a string I've found. And I've tried quite a few. The magic appears to be at the `(?<=^|,)` part. Most similar solutions have `:^` rather than `<=^`. The other solutions will be incorrect for ",,1" reducing it down to a match of 2, yours correctly identifies 3. Thank you. – Jon Bellamy Jul 17 '15 at 13:57
  • 1
    This is a fantastic solution. – TimS Aug 27 '15 at 05:51
4

CSV is not regular. Unless your regex language has sufficient power to handle the stateful nature of csv parsing (unlikely, the MS one does not) then any pure regex solution is a list of bugs waiting to happen as you hit a new input source that isn't quite handled by the last regex.

CSV reading is not that complex to write as a state machine since the grammar is simple but even so you must consider: quoted quotes, commas within quotes, new lines within quotes, empty fields.

As such you should probably just use someone else's CSV parser. I recommend CSVReader for .Net

ShuggyCoUk
  • 36,004
  • 6
  • 77
  • 101
2

Function:

    private List<string> ParseDelimitedString (string arguments, char delim = ',')
    {
        bool inQuotes = false;
        bool inNonQuotes = false; //used to trim leading WhiteSpace

        List<string> strings = new List<string>();

        StringBuilder sb = new StringBuilder();
        foreach (char c in arguments)
        {
            if (c == '\'' || c == '"')
            {
                if (!inQuotes)
                    inQuotes = true;
                else
                    inQuotes = false;
            }else if (c == delim)
            {
                if (!inQuotes)
                {
                    strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());
                    sb.Remove(0, sb.Length);
                    inNonQuotes = false;
                }
                else
                {
                    sb.Append(c);
                }
            }
            else if ( !char.IsWhiteSpace(c) && !inQuotes && !inNonQuotes)  
            {
                if (!inNonQuotes) inNonQuotes = true;
                sb.Append(c);
            }
        }
        strings.Add(sb.Replace("'", string.Empty).Replace("\"", string.Empty).ToString());


        return strings;
    }

Usage

    string myString = "cat,dog,\"0 = OFF, 1 = ON\",lion,tiger,'R = red, G = green, B = blue',bear,         text";
    List<string> strings = ParseDelimitedString(myString);

    foreach( string s in strings )
            Console.WriteLine( s );

Output:

cat
dog
0 = OFF, 1 = ON
lion
tiger
R = red, G = green, B = blue
bear
text
1

I found a few bugs in that version, for example, a non-quoted string that has a single quote in the value.

And I agree use the FileHelper library when you can, however that library requires you know what your data will look like... I need a generic parser.

So I've updated the code to the following and thought I'd share...

    static public List<string> ParseDelimitedString(string value, char delimiter)
    {
        bool inQuotes = false;
        bool inNonQuotes = false;
        bool secondQuote = false;
        char curQuote = '\0';

        List<string> results = new List<string>();

        StringBuilder sb = new StringBuilder();
        foreach (char c in value)
        {
            if (inNonQuotes)
            {
                // then quotes are just characters
                if (c == delimiter)
                {
                    results.Add(sb.ToString());
                    sb.Remove(0, sb.Length);
                    inNonQuotes = false;
                }
                else
                {
                    sb.Append(c);
                }
            }
            else if (inQuotes)
            {
                // then quotes need to be double escaped
                if ((c == '\'' && c == curQuote) || (c == '"' && c == curQuote))
                {
                    if (secondQuote)
                    {
                        secondQuote = false;
                        sb.Append(c);
                    }
                    else
                        secondQuote = true;
                }
                else if (secondQuote && c == delimiter)
                {
                    results.Add(sb.ToString());
                    sb.Remove(0, sb.Length);
                    inQuotes = false;
                }
                else if (!secondQuote)
                {
                    sb.Append(c);
                }
                else
                {
                    // bad,as,"user entered something like"this,poorly escaped,value
                    // just ignore until second delimiter found
                }
            }
            else
            {
                // not yet parsing a field
                if (c == '\'' || c == '"')
                {
                    curQuote = c;
                    inQuotes = true;
                    inNonQuotes = false;
                    secondQuote = false;
                }
                else if (c == delimiter)
                {
                    // blank field
                    inQuotes = false;
                    inNonQuotes = false;
                    results.Add(string.Empty);
                }
                else
                {
                    inQuotes = false;
                    inNonQuotes = true;
                    sb.Append(c);
                }
            }
        }

        if (inQuotes || inNonQuotes)
            results.Add(sb.ToString());

        return results;
    }
  • Found bugs in *which* version? Answers don't appear in a fixed order, so if you want to reference another answer, it's best to link to it directly. The `share` button below the answer provides the appropriate URL. – Alan Moore Jul 21 '16 at 20:15
0

since this question: Regex to to parse csv with nested quotes

reports here and is much more generic, and since a RegEx is not really the proper way to solve this problem (i.e. I have had many issues with catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html)

here is a simple parser implementation in Python as well

def csv_to_array(string):
    stack = []
    match = []
    matches = []

    for c in string:
        # do we have a quote or double quote?
        if c == "\"":
            # is it a closing match?
            if len(stack) > 0 and stack[-1] == c:
                stack.pop()
            else:
                stack.append(c)
        elif (c == "," and len(stack) == 0) or (c == "\n"):
            matches.append("".join(match))
            match = []
        else:
            match.append(c)

    return matches
MrE
  • 19,584
  • 12
  • 87
  • 105