2

How can I split comma separated strings with quoted strings that can also contain commas?

Example input:

John, Doe, "Sid, Nency", Smith

Expected output:

  • John
  • Doe
  • Sid, Nency
  • Smith

Split by commas was ok, but I've got requirement that strings like "Sid, Nency" are allowed. I tried to use regexes to split such values. Regex ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)" is from Java question and it is not working good for my .NET code. It doubles some strings, finds extra results etc.

So what is the best way to split such strings?

Andrei
  • 42,814
  • 35
  • 154
  • 218
  • It looks like you're dealing with CSV input? If so, *please* use a CSV library - there are many good ones, and it will save you a lot of pain!! If you are not, please clarify your question to explain why a CSV library would not be suitable... – RB. Dec 20 '13 at 10:14
  • No, it is not a CSV document. It's just a string – Andrei Dec 20 '13 at 10:14
  • RB, I would be happy if you show me, how can I use Csv Lib to deal with this problem – Andrei Dec 20 '13 at 10:15
  • One Hackie way is to first split across `"` and then split alternate strings (in the array obtained) by `,`. – Akshat Singhal Dec 20 '13 at 10:18
  • Perl solution here (since you put the tag back): http://stackoverflow.com/questions/2459729/how-can-i-split-a-string-by-whitespace-unless-inside-of-a-single-quoted-string – RobEarl Dec 20 '13 at 10:19
  • Am not sure how to do this, but I feel possible with "Regex Balancing Group" – Sriram Sakthivel Dec 20 '13 at 10:21
  • See http://stackoverflow.com/questions/1189416/c-regular-expressions-how-to-parse-comma-separated-values-where-some-values – Roy Dictus Dec 20 '13 at 10:26

4 Answers4

4

It's because of the capture group. Just turn it into a non-capture group:

",(?=(?:[^""]*""[^""]*"")*[^""]*$)"
      ^^

The capture group is including the captured part in your results.

ideone demo

var regexObj = new Regex(@",(?=(?:[^""]*""[^""]*"")*[^""]*$)");
regexObj.Split(input).Select(s => s.Trim('\"', ' ')).ForEach(Console.WriteLine);

And just trim the results.

Andrei
  • 42,814
  • 35
  • 154
  • 218
Jerry
  • 70,495
  • 13
  • 100
  • 144
1

Just go through your string. As you go through your string keep track
if you're in a "block" or not. If you're - don't treat the comma as
a comma (as a separator). Otherwise do treat it as such. It's a simple
algorithm, I would write it myself. When you encounter first " you enter
a block. When you encounter next ", you end that block you were, and so on.
So you can do it with one pass through your string.

import java.util.ArrayList;


public class Test003 {

    public static void main(String[] args) {
        String s = "  John, , , , \" Barry, John  \" , , , , , Doe, \"Sid ,  Nency\", Smith  ";

        StringBuilder term = new StringBuilder();
        boolean inQuote = false;
        boolean inTerm = false;
        ArrayList<String> terms = new ArrayList<String>();
        for (int i=0; i<s.length(); i++){
            char ch = s.charAt(i);
            if (ch == ' '){
                if (inQuote){
                    if (!inTerm) { 
                        inTerm = true;
                    }
                    term.append(ch);
                }
                else {
                    if (inTerm){
                        terms.add(term.toString());
                        term.setLength(0);
                        inTerm = false;
                    }
                }
            }else if (ch== '"'){
                term.append(ch); // comment this out if you don't need it
                if (!inTerm){
                    inTerm = true;
                }
                inQuote = !inQuote;
            }else if (ch == ','){
                if (inQuote){
                    if (!inTerm){
                        inTerm = true;
                    }
                    term.append(ch);
                }else{
                    if (inTerm){
                        terms.add(term.toString());
                        term.setLength(0);
                        inTerm = false;
                    }
                }
            }else{
                if (!inTerm){
                    inTerm = true;
                }
                term.append(ch);
            }
        }

        if (inTerm){
            terms.add(term.toString());
        }

        for (String t : terms){
            System.out.println("|" + t + "|");
        }

    }



}
peter.petrov
  • 38,363
  • 16
  • 94
  • 159
0

I use the following code within my Csv Parser class to achieve this:

    private string[] ParseLine(string line)
    {
        List<string> results = new List<string>();
        bool inQuotes = false;
        int index = 0;
        StringBuilder currentValue = new StringBuilder(line.Length);
        while (index < line.Length)
        {
            char c = line[index];
            switch (c)
            {
                case '\"':
                    {
                        inQuotes = !inQuotes;
                        break;
                    }

                default:
                    {
                        if (c == ',' && !inQuotes)
                        {
                            results.Add(currentValue.ToString());
                            currentValue.Clear();
                        }
                        else
                            currentValue.Append(c);
                        break;
                    }
            }
            ++index;
        }

        results.Add(currentValue.ToString());
        return results.ToArray();
    }   // eo ParseLine
Moo-Juice
  • 38,257
  • 10
  • 78
  • 128
0

If you find the regular expression too complex you can do it like this:

string initialString = "John, Doe, \"Sid, Nency\", Smith";

IEnumerable<string> splitted = initialString.Split('"');
splitted = splitted.SelectMany((str, index) => index % 2 == 0 ? str.Split(',') : new[] { str });
splitted = splitted.Where(str => !string.IsNullOrWhiteSpace(str)).Select(str => str.Trim());
lightbricko
  • 2,649
  • 15
  • 21