66

I have the following comma-separated string that I need to split. The problem is that some of the content is within quotes and contains commas that shouldn't be used in the split.

String:

111,222,"33,44,55",666,"77,88","99"

I want the output:

111  
222  
33,44,55  
666  
77,88  
99  

I have tried this:

(?:,?)((?<=")[^"]+(?=")|[^",]+)   

But it reads the comma between "77,88","99" as a hit and I get the following output:

111  
222  
33,44,55  
666  
77,88  
,  
99  
Dale K
  • 25,246
  • 15
  • 42
  • 71
Peter Norlén
  • 663
  • 1
  • 6
  • 4

16 Answers16

101

Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!

You can do so with some simple regex

(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)

This will do the following:

(?:^|,) = Match expression "Beginning of line or string ,"

(\"(?:[^\"]+|\"\")*\"|[^,]*) = A numbered capture group, this will select between 2 alternatives:

  1. stuff in quotes
  2. stuff between commas

This should give you the output you are looking for.

Example code in C#

 static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

public static string[] SplitCSV(string input)
{

  List<string> list = new List<string>();
  string curr = null;
  foreach (Match match in csvSplit.Matches(input))
  {        
    curr = match.Value;
    if (0 == curr.Length)
    {
      list.Add("");
    }

    list.Add(curr.TrimStart(','));
  }

  return list.ToArray();
}

private void button1_Click(object sender, RoutedEventArgs e)
{
    Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}

Warning As per @MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.

You should use:

(?:^|,)(\"(?:[^\"])*\"|[^,]*)

instead

Community
  • 1
  • 1
jimplode
  • 3,474
  • 3
  • 24
  • 42
  • How about some sample code? As it is, this answer makes no sense. – Alan Moore Sep 23 '10 at 09:21
  • 6
    Without a code example it made perfect sense, and no example was given as I have no idea for what language he is writing. I have now included a sample in C# – jimplode Sep 23 '10 at 09:46
  • Thanks! This is helping me moving on. I did mention the coding language in the tags, but I will write it more clearly in the text next time. – Peter Norlén Sep 23 '10 at 14:09
  • I used your solution with some trimming of the matches for commas and quotes, and that produces the result I need. Thanks a lot! I was close to the solution but you gave me the last push :o) Ok, uhm... where do I mark it as the correct answer? A little new to this site... – Peter Norlén Sep 23 '10 at 16:38
  • Aha, think I found it... – Peter Norlén Sep 23 '10 at 16:40
  • Glad I could help!! :) I like Regex!! – jimplode Sep 23 '10 at 16:56
  • 4
    Hmm, not matching for me...the comma inside the double-quotes is still used to "split" my string. – ganders Sep 23 '14 at 17:19
  • The regex works, just use the example above. If this is not working under your implementation then I suggest something is wrong with it, maybe if you give an example of your code it would be easier to help you. – jimplode Sep 24 '14 at 11:08
  • 1
    The above Regex fails with the following (assuming comma separated): `whatever,"""1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER & WC'S ),13,14,15,A""",another column` as it gets split into `whatever`, `""`,`"1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER & WC'S ),13,14,15,A"`,`""`,`another column`. Any thoughts for an update to the Regex? – Free Coder 24 Jan 23 '15 at 00:44
  • You need to escape the quote, as this is the character that we are using to separate the values. – jimplode Jan 26 '15 at 11:32
  • And thats how you save a lot of time :) – Polo Oct 27 '16 at 10:38
  • If you have to deal with files that contain line breaks inside column values, or just want an easy and robust solution for any CSV file, try _Microsoft.VisualBasic.FileIO.TextFieldParser_ as described in http://stackoverflow.com/a/3508572 – Daniel Calliess Jan 05 '17 at 16:24
  • @jimplode No, it IS escaped by doubling up. – ErikE Feb 02 '17 at 22:34
  • 3
    I changed the list.Add to list.Add(curr.TrimStart(',').TrimStart('"').TrimEnd('"')); – Tim Bassett May 13 '17 at 20:29
  • 4
    this generate a catastrophic backtracking if a line starts with a " but misses the closing one (i.e. with a corrupt csv file) http://www.regular-expressions.info/catastrophic.html this `(?:^|,)(\"(?:[^\"])*\"|[^,]*)` covers it without this issue, and is simpler. – MrE Sep 21 '17 at 21:55
  • @MrE Is absolutely right - this exact code was used by a former developer in our team and it took down an entire global courier system because there was a new line character in a string that caused the double quotes to be split across a line. Causes major crashes, will take down visual studio as well I've since discovered. – Rob Jul 25 '18 at 00:37
  • @Rob this example is exactly that, you should never just copy and paste stuff and use it in a production environment without testing your scenarios, that is just bad development. – jimplode Sep 03 '18 at 11:19
  • @jimplode I totally agree - I didn't copy and paste it - I copied the regex from the code and landed on this post - so the the guy before me copied and paste it from here :-) – Rob Sep 04 '18 at 00:29
  • This regex fails whenever there is a line with a single comma in it, but is between quotes. ie: 123 "456, 789" 012 This should not split at the comma per the original request, but the regex does exactly that. – roncli Jan 17 '19 at 00:09
  • the expression doesn't parse something like: "expression with, comma",second, "last, end term, with comma". – Babak Aug 23 '19 at 15:32
  • I am using the above code to spllit my CSV. But it is failing for a row like ,kg,Mass. I am expecting an array like ["","kg","Mass"] but it is returning ["", "Mass"] – DevMJ Jun 01 '20 at 14:53
  • this worked like a charm thank you very much! – Cees Oct 29 '21 at 03:36
23

Fast and easy:

    public static string[] SplitCsv(string line)
    {
        List<string> result = new List<string>();
        StringBuilder currentStr = new StringBuilder("");
        bool inQuotes = false;
        for (int i = 0; i < line.Length; i++) // For each character
        {
            if (line[i] == '\"') // Quotes are closing or opening
                inQuotes = !inQuotes;
            else if (line[i] == ',') // Comma
            {
                if (!inQuotes) // If not in quotes, end of current string, add it to result
                {
                    result.Add(currentStr.ToString());
                    currentStr.Clear();
                }
                else
                    currentStr.Append(line[i]); // If in quotes, just add it 
            }
            else // Add any other character to current string
                currentStr.Append(line[i]); 
        }
        result.Add(currentStr.ToString());
        return result.ToArray(); // Return array of all strings
    }

With this string as input :

 111,222,"33,44,55",666,"77,88","99"

It will return :

111  
222  
33,44,55  
666  
77,88  
99  
Antoine
  • 231
  • 2
  • 3
  • It would be most useful if you could explain the main parts of your approach in your code. – Yannis Dec 15 '17 at 14:50
  • Ok I added comments and example. Also optimized it using StringBuilder. – Antoine Dec 18 '17 at 20:11
  • Nice work. It helped me a lot. Thanks. – phil1630 Feb 06 '20 at 03:39
  • Love this answer. The problem is similar to how to process expression (e.g. mathematical with parentheses and operators) and this concept solves it in a straightforward, predictable and readable manner. Not like the regex solutions. – Mars Mar 12 '20 at 09:43
  • CAUTION: This solution may have a problem! If one of the columns in your line has a "real" quotation mark in the middle of its text (by real i mean, it belongs to the text and is not intended to show data semantics), it does not work properly. For example, if you have the following line: "C1", "C2withA"Text", "C3" then the algorithm above will NOT give you the 3 strings C1 C2withA"Text C3 Yet this example is weird and should be avoided (maybe by data cleaning rules?), this may happen, and Excel indeed gets the "correct" result C1, C2..., C3. – Timbu42 Nov 18 '21 at 18:16
17

i really like jimplode's answer, but I think a version with yield return is a little bit more useful, so here it is:

public IEnumerable<string> SplitCSV(string input)
{
    Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

    foreach (Match match in csvSplit.Matches(input))
    {
        yield return match.Value.TrimStart(',');
    }
}

Maybe it's even more useful to have it like an extension method:

public static class StringHelper
{
    public static IEnumerable<string> SplitCSV(this string input)
    {
        Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

        foreach (Match match in csvSplit.Matches(input))
        {
            yield return match.Value.TrimStart(',');
        }
    }
}
qqbenq
  • 10,220
  • 4
  • 40
  • 45
7

This regular expression works without the need to loop through values and TrimStart(','), like in the accepted answer:

((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))

Here is the implementation in C#:

string values = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";

MatchCollection matches = new Regex("((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))").Matches(values);

foreach (var match in matches)
{
    Console.WriteLine(match);
}

Outputs

111  
222  
33,44,55  
666  
77,88  
99  
Chris Schiffhauer
  • 17,102
  • 15
  • 79
  • 88
  • The above Regex fails with the following (assuming comma separated): `whatever,"""1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER & WC'S ),13,14,15,A""",another column` as it gets split into `whatever`, `""`,`"1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER & WC'S ),13,14,15,A"`,`""`,`another column`. Any thoughts for an update to the Regex? – Free Coder 24 Jan 23 '15 at 00:50
  • @FreeCoder24 http://omegacoder.com/?p=542 – OzBob May 06 '16 at 05:02
3

None of these answers work when the string has a comma inside quotes, as in "value, 1", or escaped double-quotes, as in "value ""1""", which are valid CSV that should be parsed as value, 1 and value "1", respectively.

This will also work with the tab-delimited format if you pass in a tab instead of a comma as your delimiter.

public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
    var currentString = new StringBuilder();
    var inQuotes = false;
    var quoteIsEscaped = false; //Store when a quote has been escaped.
    row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
    foreach (var character in row.Select((val, index) => new {val, index}))
    {
        if (character.val == delimiter) //We hit a delimiter character...
        {
            if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
            {
                Console.WriteLine(currentString);
                yield return currentString.ToString();
                currentString.Clear();
            }
            else
            {
                currentString.Append(character.val);
            }
        } else {
            if (character.val != ' ')
            {
                if(character.val == '"') //If we've hit a quote character...
                {
                    if(character.val == '\"' && inQuotes) //Does it appear to be a closing quote?
                    {
                        if (row[character.index + 1] == character.val) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
                        {
                            quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
                        }
                        else if (quoteIsEscaped)
                        {
                            quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
                            currentString.Append(character.val);
                        }
                        else
                        {
                            inQuotes = false;
                        }
                    }
                    else
                    {
                        if (!inQuotes)
                        {
                            inQuotes = true;
                        }
                        else
                        {
                            currentString.Append(character.val); //...It's a quote inside a quote.
                        }
                    }
                }
                else
                {
                    currentString.Append(character.val);
                }
            }
            else
            {
                if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
                {
                    currentString.Append(character.val);
                }
            }
        }
    }
}
Community
  • 1
  • 1
Chad Hedgcock
  • 11,125
  • 3
  • 36
  • 44
3

With minor updates to the function provided by "Chad Hedgcock".

Updates are on:

Line 26: character.val == '\"' - This can never be true due to the check made on Line 24. i.e. character.val == '"'

Line 28: if (row[character.index + 1] == character.val) added !quoteIsEscaped to escape 3 consecutive quotes.

public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
var currentString = new StringBuilder();
var inQuotes = false;
var quoteIsEscaped = false; //Store when a quote has been escaped.
row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
foreach (var character in row.Select((val, index) => new {val, index}))
{
    if (character.val == delimiter) //We hit a delimiter character...
    {
        if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
        {
            //Console.WriteLine(currentString);
            yield return currentString.ToString();
            currentString.Clear();
        }
        else
        {
            currentString.Append(character.val);
        }
    } else {
        if (character.val != ' ')
        {
            if(character.val == '"') //If we've hit a quote character...
            {
                if(character.val == '"' && inQuotes) //Does it appear to be a closing quote?
                {
                    if (row[character.index + 1] == character.val && !quoteIsEscaped) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
                    {
                        quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
                    }
                    else if (quoteIsEscaped)
                    {
                        quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
                        currentString.Append(character.val);
                    }
                    else
                    {
                        inQuotes = false;
                    }
                }
                else
                {
                    if (!inQuotes)
                    {
                        inQuotes = true;
                    }
                    else
                    {
                        currentString.Append(character.val); //...It's a quote inside a quote.
                    }
                }
            }
            else
            {
                currentString.Append(character.val);
            }
        }
        else
        {
            if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
            {
                currentString.Append(character.val);
            }
        }
    }
}

}

Shiva
  • 31
  • 1
2

For Jay's answer, if you use a 2nd boolean then you can have nested double-quotes inside single-quotes and vice-versa.

    private string[] splitString(string stringToSplit)
{
    char[] characters = stringToSplit.ToCharArray();
    List<string> returnValueList = new List<string>();
    string tempString = "";
    bool blockUntilEndQuote = false;
    bool blockUntilEndQuote2 = false;
    int characterCount = 0;
    foreach (char character in characters)
    {
        characterCount = characterCount + 1;

        if (character == '"' && !blockUntilEndQuote2)
        {
            if (blockUntilEndQuote == false)
            {
                blockUntilEndQuote = true;
            }
            else if (blockUntilEndQuote == true)
            {
                blockUntilEndQuote = false;
            }
        }
        if (character == '\'' && !blockUntilEndQuote)
        {
            if (blockUntilEndQuote2 == false)
            {
                blockUntilEndQuote2 = true;
            }
            else if (blockUntilEndQuote2 == true)
            {
                blockUntilEndQuote2 = false;
            }
        }

        if (character != ',')
        {
            tempString = tempString + character;
        }
        else if (character == ',' && (blockUntilEndQuote == true || blockUntilEndQuote2 == true))
        {
            tempString = tempString + character;
        }
        else
        {
            returnValueList.Add(tempString);
            tempString = "";
        }

        if (characterCount == characters.Length)
        {
            returnValueList.Add(tempString);
            tempString = "";
        }
    }

    string[] returnValue = returnValueList.ToArray();
    return returnValue;
}
And Wan
  • 314
  • 1
  • 3
  • 12
2

The original version

Currently I use the following regex:

public static Regex regexCSVSplit = new Regex(@"(?x:(
      (?<FULL>
        (^|[,;\t\r\n])\s*
        ( (?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>) |
          (?<QUODAT> (?<DAT> [^""',;\s\r\n]* )) )
        (?=\s*([,;\t\r\n]|$))
      ) |
      (?<FULL>
        (^|[\s\t\r\n])
        ( (?<QUODAT> (?<QUO>[""'])(?<DAT> [^""',;\s\t\r\n]* )\k<QUO>) |
          (?<QUODAT> (?<DAT> [^""',;\s\t\r\n]* )) )
        (?=[,;\s\t\r\n]|$)
      )
    ))", RegexOptions.Compiled);

This solution can handle pretty chaotic cases too like below: enter image description here

This is how to feed the result into an array:

var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
      Select(x => x.Groups["DAT"].Value).ToArray();

See this example in action HERE

Note: The regular expression contains two set of <FULL> block and each of them contains two <QUODAT> block separated by "or" (|). Depending on your task you may only need one of them.

Note: That this regular expression gives us one string array, and works on single line with or without <carrier return> and/or <line feed>.

Simplified version

The following regular expression will already cover many complex cases:

public static Regex regexCSVSplit = new Regex(@"(?x:(
      (?<FULL>
        (^|[,;\t\r\n])\s*
        (?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>)
        (?=\s*([,;\t\r\n]|$))
      )
    ))", RegexOptions.Compiled);

See this example in action: HERE

It can process complex, easy and empty items too: enter image description here enter image description here

This is how to feed the result into an array:

var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
      Select(x => x.Groups["DAT"].Value).ToArray();

The main rule here is that every item may contain anything but the <quotation mark><separators><comma> sequence AND each item shall being and end with the same <quotation mark>.

  • <quotation mark>: <">, <'>
  • <comma>: <,>, <;>, <tab>, <carrier return>, <line feed>

Edit notes: I added some more explanation to make it easier to understand and replaces the text "CO" with "QUO".

minus one
  • 642
  • 1
  • 7
  • 28
1

I know I'm a bit late to this, but for searches, here is how I did what you are asking about in C sharp

private string[] splitString(string stringToSplit)
    {
        char[] characters = stringToSplit.ToCharArray();
        List<string> returnValueList = new List<string>();
        string tempString = "";
        bool blockUntilEndQuote = false;
        int characterCount = 0;
        foreach (char character in characters)
        {
            characterCount = characterCount + 1;

            if (character == '"')
            {
                if (blockUntilEndQuote == false)
                {
                    blockUntilEndQuote = true;
                }
                else if (blockUntilEndQuote == true)
                {
                    blockUntilEndQuote = false;
                }
            }

            if (character != ',')
            {
                tempString = tempString + character;
            }
            else if (character == ',' && blockUntilEndQuote == true)
            {
                tempString = tempString + character;
            }
            else
            {
                returnValueList.Add(tempString);
                tempString = "";
            }

            if (characterCount == characters.Length)
            {
                returnValueList.Add(tempString);
                tempString = "";
            }
        }

        string[] returnValue = returnValueList.ToArray();
        return returnValue;
    }
Bbb
  • 517
  • 6
  • 27
1

Don't reinvent a CSV parser, try FileHelpers.

Marcos Meli
  • 3,468
  • 24
  • 29
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • 8
    That solution looks pretty cumbersome for throwaway type csv parsing. According to the docs, "Next you need to define a class that maps to the record in the source/destination file.". So if I'm wrinting a throwaway one time program to parse a CSV file, I have to define a class that contains every field in the csv file? No thanks. – dcp Mar 13 '13 at 12:23
  • 6
    Yeah, this answer is rubbish. No explanation, use cases are limited, DOESN'T address the users question. – user1567453 Nov 05 '15 at 04:30
1

Try this:

       string s = @"111,222,""33,44,55"",666,""77,88"",""99""";

       List<string> result = new List<string>();

       var splitted = s.Split('"').ToList<string>();
       splitted.RemoveAll(x => x == ",");
       foreach (var it in splitted)
       {
           if (it.StartsWith(",") || it.EndsWith(","))
           {
               var tmp = it.TrimEnd(',').TrimStart(',');
               result.AddRange(tmp.Split(','));
           }
           else
           {
               if(!string.IsNullOrEmpty(it)) result.Add(it);
           }
      }
       //Results:

       foreach (var it in result)
       {
           Console.WriteLine(it);
       }
nan
  • 19,595
  • 7
  • 48
  • 80
  • 1
    Your function can't handle strings within quotes starting with a comma. string s = @"AAA,"",BBB,CCC"""; Above string should result in two tokens, but your function output three tokens. – Wallstreet Programmer Feb 10 '14 at 19:08
1

I needed something a little more robust, so I took from here and created this... This solution is a little less elegant and a little more verbose, but in my testing (with a 1,000,000 row sample), I found this to be 2 to 3 times faster. Plus it handles non-escaped, embedded quotes. I used string delimiter and qualifiers instead of chars because of the requirements of my solution. I found it more difficult than I expected to find a good, generic CSV parser so I hope this parsing algorithm can help someone.

    public static string[] SplitRow(string record, string delimiter, string qualifier, bool trimData)
    {
        // In-Line for example, but I implemented as string extender in production code
        Func <string, int, int> IndexOfNextNonWhiteSpaceChar = delegate (string source, int startIndex)
        {
            if (startIndex >= 0)
            {
                if (source != null)
                {
                    for (int i = startIndex; i < source.Length; i++)
                    {
                        if (!char.IsWhiteSpace(source[i]))
                        {
                            return i;
                        }
                    }
                }
            }

            return -1;
        };

        var results = new List<string>();
        var result = new StringBuilder();
        var inQualifier = false;
        var inField = false;

        // We add new columns at the delimiter, so append one for the parser.
        var row = $"{record}{delimiter}";

        for (var idx = 0; idx < row.Length; idx++)
        {
            // A delimiter character...
            if (row[idx]== delimiter[0])
            {
                // Are we inside qualifier? If not, we've hit the end of a column value.
                if (!inQualifier)
                {
                    results.Add(trimData ? result.ToString().Trim() : result.ToString());
                    result.Clear();
                    inField = false;
                }
                else
                {
                    result.Append(row[idx]);
                }
            }

            // NOT a delimiter character...
            else
            {
                // ...Not a space character
                if (row[idx] != ' ')
                {
                    // A qualifier character...
                    if (row[idx] == qualifier[0])
                    {
                        // Qualifier is closing qualifier...
                        if (inQualifier && row[IndexOfNextNonWhiteSpaceChar(row, idx + 1)] == delimiter[0])
                        {
                            inQualifier = false;
                            continue;
                        }

                        else
                        {
                            // ...Qualifier is opening qualifier
                            if (!inQualifier)
                            {
                                inQualifier = true;
                            }

                            // ...It's a qualifier inside a qualifier.
                            else
                            {
                                inField = true;
                                result.Append(row[idx]);
                            }
                        }
                    }

                    // Not a qualifier character...
                    else
                    {
                        result.Append(row[idx]);
                        inField = true;
                    }
                }

                // ...A space character
                else
                {
                    if (inQualifier || inField)
                    {
                        result.Append(row[idx]);
                    }
                }
            }
        }

        return results.ToArray<string>();
    }

Some test code:

        //var input = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";

        var input =
            "111, 222, \"99\",\"33,44,55\" ,      \"666 \"mark of a man\"\", \" spaces \"77,88\"   \"";

        Console.WriteLine("Split with trim");
        Console.WriteLine("---------------");
        var result = SplitRow(input, ",", "\"", true);
        foreach (var r in result)
        {
            Console.WriteLine(r);
        }
        Console.WriteLine("");

        // Split 2
        Console.WriteLine("Split with no trim");
        Console.WriteLine("------------------");
        var result2 = SplitRow(input, ",", "\"", false);
        foreach (var r in result2)
        {
            Console.WriteLine(r);
        }
        Console.WriteLine("");

        // Time Trial 1
        Console.WriteLine("Experimental Process (1,000,000) iterations");
        Console.WriteLine("-------------------------------------------");
        watch = Stopwatch.StartNew();
        for (var i = 0; i < 1000000; i++)
        {
            var x1 = SplitRow(input, ",", "\"", false);
        }
        watch.Stop();
        elapsedMs = watch.ElapsedMilliseconds;
        Console.WriteLine($"Total Process Time: {string.Format("{0:0.###}", elapsedMs / 1000.0)} Seconds");
        Console.WriteLine("");

Results

Split with trim
---------------
111
222
99
33,44,55
666 "mark of a man"
spaces "77,88"

Split with no trim
------------------
111
222
99
33,44,55
666 "mark of a man"
 spaces "77,88"

Original Process (1,000,000) iterations
-------------------------------
Total Process Time: 7.538 Seconds

Experimental Process (1,000,000) iterations
--------------------------------------------
Total Process Time: 3.363 Seconds
Sam Jazz
  • 131
  • 7
  • 1
    This method is actually better. Faster than the RegEx method – Tawani Feb 07 '19 at 22:47
  • You should use char types for delimiter and qualifier, as you are only using the first character of the string now anyway. – Martin Watts Nov 25 '20 at 15:48
  • Two problems with this code: Firstly, if 'trimData' is true, then it preserves trailing whitespace, but not leading whitespace. Secondly, it treats a two consecutive quotes as a combination of a literal quote plus the terminal quote of the substring. If you want to parse a CSV file that is saved by Excel, then _three consecutive quotes_ represent a literal quote and the terminal enclosing quote. I heavily modified the code and [posted it in an answer to a duplicate question](https://stackoverflow.com/a/67814742/2998072). – Tony Pulokas Jun 03 '21 at 02:56
0

Here is my fastest implementation based upon string raw pointer manipulation:

string[] FastSplit(string sText, char? cSeparator = null, char? cQuotes = null)
    {            
        string[] oTokens;

        if (null == cSeparator)
        {
            cSeparator = DEFAULT_PARSEFIELDS_SEPARATOR;
        }

        if (null == cQuotes)
        {
            cQuotes = DEFAULT_PARSEFIELDS_QUOTE;
        }

        unsafe
        {
            fixed (char* lpText = sText)
            {
                #region Fast array estimatation

                char* lpCurrent      = lpText;                    
                int   nEstimatedSize = 0;

                while (0 != *lpCurrent)
                {
                    if (cSeparator == *lpCurrent)
                    {
                        nEstimatedSize++;
                    }

                    lpCurrent++;
                }

                nEstimatedSize++; // Add EOL char(s)
                string[] oEstimatedTokens = new string[nEstimatedSize];

                #endregion

                #region Parsing

                char[] oBuffer = new char[sText.Length];
                int    nIndex  = 0;
                int    nTokens = 0;

                lpCurrent      = lpText;

                while (0 != *lpCurrent)
                {
                    if (cQuotes == *lpCurrent)
                    {
                        // Quotes parsing

                        lpCurrent++; // Skip quote
                        nIndex = 0;  // Reset buffer

                        while (
                               (0       != *lpCurrent)
                            && (cQuotes != *lpCurrent)
                        )
                        {
                            oBuffer[nIndex] = *lpCurrent; // Store char

                            lpCurrent++; // Move source cursor
                            nIndex++;    // Move target cursor
                        }

                    } 
                    else if (cSeparator == *lpCurrent)
                    {
                        // Separator char parsing

                        oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex); // Store token
                        nIndex                      = 0;                              // Skip separator and Reset buffer
                    }
                    else
                    {
                        // Content parsing

                        oBuffer[nIndex] = *lpCurrent; // Store char
                        nIndex++;                     // Move target cursor
                    }

                    lpCurrent++; // Move source cursor
                }

                // Recover pending buffer

                if (nIndex > 0)
                {
                    // Store token

                    oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex);
                }

                // Build final tokens list

                if (nTokens == nEstimatedSize)
                {
                    oTokens = oEstimatedTokens;
                }
                else
                {
                    oTokens = new string[nTokens];
                    Array.Copy(oEstimatedTokens, 0, oTokens, 0, nTokens);
                }

                #endregion
            }
        }

        // Epilogue            

        return oTokens;
    }
Antonio Petricca
  • 8,891
  • 5
  • 36
  • 74
0

I once had to do something similar and in the end I got stuck with Regular Expressions. The inability for Regex to have state makes it pretty tricky - I just ended up writing a simple little parser.

If you're doing CSV parsing you should just stick to using a CSV parser - don't reinvent the wheel.

Jaco Pretorius
  • 24,380
  • 11
  • 62
  • 94
0

Try this

private string[] GetCommaSeperatedWords(string sep, string line)
    {
        List<string> list = new List<string>();
        StringBuilder word = new StringBuilder();
        int doubleQuoteCount = 0;
        for (int i = 0; i < line.Length; i++)
        {
            string chr = line[i].ToString();
            if (chr == "\"")
            {
                if (doubleQuoteCount == 0)
                    doubleQuoteCount++;
                else
                    doubleQuoteCount--;

                continue;
            }
            if (chr == sep && doubleQuoteCount == 0)
            {
                list.Add(word.ToString());
                word = new StringBuilder();
                continue;
            }
            word.Append(chr);
        }

        list.Add(word.ToString());

        return list.ToArray();
    }
Krish
  • 616
  • 9
  • 19
0

This is Chad's answer rewritten with state based logic. His answered failed for me when it came across """BRAD""" as a field. That should return "BRAD" but it just ate up all the remaining fields. When I tried to debug it I just ended up rewriting it as state based logic:

enum SplitState { s_begin, s_infield, s_inquotefield, s_foundquoteinfield };
public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
{
    var currentString = new StringBuilder();
    SplitState state = SplitState.s_begin;
    row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
    foreach (var character in row.Select((val, index) => new { val, index }))
    {
        //Console.WriteLine("character = " + character.val + " state = " + state);
        switch (state)
        {
            case SplitState.s_begin:
                if (character.val == delimiter)
                {
                    /* empty field */
                    yield return currentString.ToString();
                    currentString.Clear();
                } else if (character.val == '"')
                {
                    state = SplitState.s_inquotefield;
                } else
                {
                    currentString.Append(character.val);
                    state = SplitState.s_infield;
                }
                break;
            case SplitState.s_infield:
                if (character.val == delimiter)
                {
                    /* field with data */
                    yield return currentString.ToString();
                    state = SplitState.s_begin;
                    currentString.Clear();
                } else
                {
                    currentString.Append(character.val);
                }
                break;
            case SplitState.s_inquotefield:
                if (character.val == '"')
                {
                    // could be end of field, or escaped quote.
                    state = SplitState.s_foundquoteinfield;
                } else
                {
                    currentString.Append(character.val);
                }
                break;
            case SplitState.s_foundquoteinfield:
                if (character.val == '"')
                {
                    // found escaped quote.
                    currentString.Append(character.val);
                    state = SplitState.s_inquotefield;
                }
                else if (character.val == delimiter)
                {
                    // must have been last quote so we must find delimiter
                    yield return currentString.ToString();
                    state = SplitState.s_begin;
                    currentString.Clear();
                }
                else
                {
                    throw new Exception("Quoted field not terminated.");
                }
                break;
            default:
                throw new Exception("unknown state:" + state);
        }
    }
    //Console.WriteLine("currentstring = " + currentString.ToString());
}

This is a lot more lines of code than the other solutions, but it is easy to modify to add edge cases.

Be Kind To New Users
  • 9,672
  • 13
  • 78
  • 125