57

I got quite a lot of strings (segments of SQL code, actually) with the following format:

('ABCDEFG', 123542, 'XYZ 99,9')

and i need to split this string, using C#, in order to get:

  • 'ABCDEFG'
  • 123542
  • 'XYZ 99,9'

I was originally using a simple Split(','), but since that comma inside the last parameter is causing havoc in the output i need to use Regex to get it. The problem is that i'm still quite noobish in regular expressions and i can't seem to crack the pattern mainly because inside that string both numerical and alpha-numerical parameters may exist at any time...

What could i use to split that string according to every comma outside the quotes? Cheers

João Pereira
  • 3,545
  • 7
  • 44
  • 53
  • 1
    possible duplicate of [Java: splitting a comma-separated string but ignoring commas in quotes](http://stackoverflow.com/questions/1757065/java-splitting-a-comma-separated-string-but-ignoring-commas-in-quotes) – Bart Kiers Jun 30 '10 at 09:42
  • 11
    Except it's in C#.............. – João Pereira Jun 30 '10 at 09:47
  • Not to mention SO's search didn't show that thread either. – João Pereira Jun 30 '10 at 09:48
  • 2
    Sure, but the regex is practically the same and it's trivial to convert into C#. I just found it worth mentioning since the other thread contains a bit more explanation on the regex. – Bart Kiers Jun 30 '10 at 10:18

7 Answers7

89

You could split on all commas, that do have an even number of quotes following them , using the following Regex to find them:

",(?=(?:[^']*'[^']*')*[^']*$)"

You'd use it like

var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");
Jens
  • 25,229
  • 9
  • 75
  • 117
  • @Jens: Lools like it fails when you have a single quote inside the string value: e.g. 'op','ID','script','Mike's','Content-Length' – Misha Narinsky May 05 '12 at 22:04
  • @Michael: Yes, it fails in that case. But I think most other parsers would fail as well in this case. You'd need to escape the quote (and correct the regex to respoect that). – Jens May 06 '12 at 10:08
  • 1
    @MichaelNarinsky Single quotes within a quoted string isn't valid to begin with. The value `'op','ID','script','Mike's','Content-Length' ` is invalid and should be `'op','ID','script','Mike''s','Content-Length'` which I believe still works. (According to SQL string escaping) – Robert McKee Apr 29 '15 at 16:06
  • 1
    I wasn't able to capture empty columns with this. However, if you checkout my answer here, you'll see what I used instead. – Ristogod Feb 25 '21 at 18:23
39
//this regular expression splits string on the separator character NOT inside double quotes. 
//separatorChar can be any character like comma or semicolon etc. 
//it also allows single quotes inside the string value: e.g. "Mike's Kitchen","Jane's Room"
Regex regx = new Regex(separatorChar + "(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"); 
string[] line = regx.Split(string to split);
user2661454
  • 391
  • 4
  • 3
11

I had a problem where it wasn't capturing empty columns. I modified it as such to get empty string results

var results = Regex.Split(source, "[,]{1}(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
Ristogod
  • 905
  • 4
  • 14
  • 29
7

although I too like a challenge some of the time, but this actually isn't one. please read this article http://secretgeek.net/csv_trouble.asp and then go on and use http://www.filehelpers.com/

[Edit1, 3]: or maybe this article can help too (the link only shows some VB.Net sample code but still, you can use it with C# too!): http://msdn.microsoft.com/en-us/library/cakac7e6.aspx

I've tried to do the sample for C# (add reference to Microsoft.VisualBasic to your project)

using System;
using System.IO;
using Microsoft.VisualBasic.FileIO;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            TextReader reader = new StringReader("('ABCDEFG', 123542, 'XYZ 99,9')");
            TextFieldParser fieldParser = new TextFieldParser(reader);

            fieldParser.TextFieldType = Microsoft.VisualBasic.FileIO.FieldType.Delimited;
            fieldParser.SetDelimiters(",");

            String[] currentRow; 

            while (!fieldParser.EndOfData)
            {
                try
                {
                     currentRow = fieldParser.ReadFields();

                     foreach(String currentField in currentRow)
                     {
                        Console.WriteLine(currentField);                        
                     }
                }
                catch (MalformedLineException e)
                {
                    Console.WriteLine("Line {0} is not valid and will be skipped.", e);
               }

            } 

        }
    }
}

[Edit2]: found another one which could be of help here: http://www.codeproject.com/KB/database/CsvReader.aspx

-- reinhard

macf00bar
  • 673
  • 1
  • 14
  • 32
  • This isn't for CSVs, although Filehelpers looks interesting. Thanks – João Pereira Jun 30 '10 at 11:45
  • although your sample string is not a CSV file you could still look at it as one row from a CSV. I just wanted to point out, as many others have to people trying to use RegEx for parsing HTML and RegEx is definitely not good for that, that also for parsing CVS like strings it's better to use a parser/helper/whatever instead of plain RegEx. – macf00bar Jun 30 '10 at 11:52
  • lol this is a C# project and i was looking for one solution in that same language, sorry but i'm not going to use VB (ffs). Thanks again – João Pereira Jul 01 '10 at 09:47
  • 3
    @Hal: just because the sample code is VB doesn't mean you can't use it in C# (add a reference to Microsoft.VisualBasic and add using Microsoft.VisualBasic.FileIO; and you're fine to use TextFieldParser) – macf00bar Jul 01 '10 at 14:40
  • 1
    @pastacool I've been having issues with a CSV for days and just came across this answer. It worked fantastic in my situation, great work! – DeeDub Dec 04 '12 at 16:52
  • 1
    @Hal Your values are separated by commans so why isn't it CSV what stands for Comma Separated Values? Why would be a problem to use an assembly written in VB.Net? On the other hand VisualBasic namespace doesn't necessarly mean the assembly is compiled from Visual Basic, could be any other .Net language. – user3285954 Jun 30 '14 at 11:46
  • Looking back, it was a pretty stupid comment I said back there. Sorry about that. I wanted to do everything in C# because I don't really like VB. – João Pereira Jun 30 '14 at 13:23
0

Try (hacked from Jens') in the split method:

",(?:.*?'[^']*?')"

or just add question marks after Jens' *'s, that makes it lazy rather than greedy.

fbstj
  • 1,684
  • 1
  • 16
  • 22
  • 1
    You seem to be missing the point of Jens's regex. The part after the comma has to be a lookahead, and the lookahead has to account for all the remaining quotes. It has to be anchored with `$`, so non-greedy quantifiers are pointless, and it can't use `.` because that will make it lose count of the quotes. – Alan Moore Jul 02 '10 at 06:32
0

... or you could have installed NuGet package LumenWorks CsvReader and done something like below where I read a csv file which has content like for example

"hello","how","hello, how are you"
"hi","hello","greetings"
...

and process it like this

public static void ProcessCsv()
        {
            var filename = @"your_file_path\filename.csv";
            DataTable dt = new DataTable("MyTable");

            List<string> product_codes = new List<string>();
            using (CsvReader csv = new CsvReader(new StreamReader(filename), true))
            {
                int fieldCount = csv.FieldCount;

                string[] headers = csv.GetFieldHeaders();
                for (int i = 0; i < headers.Length; i++)
                {
                     dt.Columns.Add(headers[i], typeof(string));
                }

                while (csv.ReadNextRecord())
                {
                    DataRow dr = dt.NewRow();
                    for (int i = 0; i < fieldCount; i++)
                    {
                        product_codes.Add(csv[i]);
                        dr[i] = csv[i];
                    }
                    dt.Rows.Add(dr);
                }
            }
        }
pixel
  • 9,653
  • 16
  • 82
  • 149
-2

The accepted answer does not work for me (can put in and test at Regexr-dot-com and see that does not work). So I had to read the lines into an array of lines. Use (C#) Regex.Matches to get an array of any strings found between escaped quotes (your in-field commas should be in fields wrapped in quotes), and replace commas with || before splitting each line into columns/fields. After splitting each line, I looped each line and column to replace || with commas.

        private static IEnumerable<string[]> ReadCsv(string fileName, char delimiter = ';')
    {
        string[] lines = File.ReadAllLines(fileName, Encoding.ASCII);
        // Before splitting on comma for a field array, we have to replace commas witin the fields
        for(int l = 1; l < lines.Length; l++)
        {
            //(\\")(.*?)(\\")
            MatchCollection regexGroup2 = Regex.Matches(lines[l], "(\\\")(.*?)(\\\")");
            if (regexGroup2.Count > 0)
            {
                for (int g = 0; g < regexGroup2.Count; g++)
                {
                    lines[l] = lines[l].Replace(regexGroup2[g].Value, regexGroup2[g].Value.Replace(",", "||"));
                }
            }
        }

        // Split
        IEnumerable<string[]> lines_split = lines.Select(a => a.Split(delimiter));

        // Reapply commas
        foreach(string[] row in lines_split)
        {
            for(int c = 0; c < row.Length; c++)
            {
                row[c] = row[c].Replace("||", ",");
            }
        }

        return (lines_split);
    }