2

I have a CSV file, with the following type of data:

0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,

and I desire the following output

0
VT,C
0
0
C,VT
0
0
VT,H
0

Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:

("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"

however this gives me the result of:

0
VT
C
0
0
C
VT
0
0
VT
H
0

This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?

Matt_Johndon
  • 204
  • 1
  • 6
  • 15

4 Answers4

1

Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.

Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:

'(?<value>[^']*?)'

It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.

[EDIT]

I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.

Another expression to model such a solution may be:

('[^']+'|[^,]+),?

It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.

Diego D
  • 6,156
  • 2
  • 17
  • 30
0

This regex is based of the fact you have 1 digit before and after your 'value'

Regex.Replace(input, @"(?:(?<=\d),|,(?=\d))", "\n");

You can test it out on RegexStorm

Pierluc SS
  • 3,138
  • 7
  • 31
  • 44
0

I have manages to get the following method to read the file as required:

public List<string> SplitCSV(string input, List<string> line)
    {

        Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);

        foreach (Match match in csvSplit.Matches(input))
        {
            line.Add(match.Value.TrimStart(','));
        }
        return line; 
    }

Thanks for everyone help though.

Community
  • 1
  • 1
Matt_Johndon
  • 204
  • 1
  • 6
  • 15
  • Actually that doesn't compile because you should add the value to the `hot` List...and should use `TrimEnd()` not `TrimStart()`. That uses the strategy I suggested but a different regular expression. your expression doesn't consider cases not strictly the same as your samples above. That's why I wrote a more general expression. Anyway you seem to ask a question and then leave the discussion going for your own path. Hope your solution will not fail on further cases. – Diego D Aug 03 '12 at 14:53
  • What is input supposed to represent? – Kala J May 19 '14 at 20:02
0
foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))
Anirudha
  • 32,393
  • 7
  • 68
  • 89