1

I am getting lots of files, in which I have zero control, that I need to split based on delimiter. But I don't want to split when the delimiter is inside quotes. So, column1, column2, column3 is

column1
column2
column3

however column1, "column2," column3 is

column1
"column2," column3

This works using this RegEx (under C#)

((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))

Now, my problem is when there is a line with only one double quote (opening, or closing only) For example column1, column2", column3 returns

column1

column3

while it should return

column1
column2"
column3

I have found lots of RegEx related, but all of them fail in the above particular example.

wizard
  • 145
  • 1
  • 14
  • 3
    It seems you are parsing a CSV file, why not use the [built-in library](https://stackoverflow.com/a/20523165/3832970) ([another link](https://stackoverflow.com/a/3508572/3832970))? and if you just need to [parse a CSV string](https://stackoverflow.com/a/6543418/3832970), not a file. – Wiktor Stribiżew Sep 06 '21 at 12:30
  • The code is using LINQ to get the collection of rows, that iare split using the RegEx. I cannot change that part as it is used by many other components. – wizard Sep 06 '21 at 12:36
  • So your delimiter should be any line in the file that is not enclosed in quotes? – Niel Godfrey Pablo Ponciano Sep 06 '21 at 12:46
  • What is the code? If you use `.Matches`, you can probably just use `Regex.Matches(text, "(?:\"[^\"]*\"|[^,])+")`. – Wiktor Stribiżew Sep 06 '21 at 13:01
  • @WiktorStribiżew almost works, but it skips columns that are null. – wizard Sep 06 '21 at 14:29
  • 1
    @wizard Then I think `Regex.Matches(text, "(?:\"[^\"]*\"|[^,])+|(?<![^,])(?![^,])")` will work. – Wiktor Stribiżew Sep 06 '21 at 15:33

1 Answers1

1

You can match all the fields you need using

Regex.Matches(text, "(?:\"[^\"]*\"|[^,])+|(?<![^,])(?![^,])")

See the regex demo. Details:

  • (?:\"[^\"]*\"|[^,])+ - one or more occurrences of
    • "[^"]*" - a ", zero or more chars other than " and then a " (if there can be "" inside, replace with "[^"]*(?:""[^"]*)*")
    • | - or
    • [^,] - any char but ,
  • | - or
  • (?<![^,])(?![^,]) - a location that is either at the start of string or is immediately preceded with a comma, and is either at the end, or immediately followed with a comma.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563