0

I have the a text file as follows:

"0","Column","column2","Column3"

I have managed to get the data down to split to the following:

"0"
"Column"
"Column2"
"Column3"

with ,(?=(?:[^']*'[^']*')*[^']*$), now I want to remove the quotes. I have tested the expression [^\s"']+|"([^"]*)"|\'([^\']*) an online regex tester, which gives the correct output im looking for. However, I am getting a syntax error when using the expression:

String[] columns = Regex.Split(dataLine, "[^\s"']+|"([^"]*)"|\'([^\']*)");

Syntax error ',' expected

I've tried escaping characters but to no avail, am I missing something?

Any help would be greatly appreciated!

Thanks.

Cal
  • 17
  • 1
  • 6

3 Answers3

1

C# might be escaping the backslash. Try:

String[] columns = Regex.Split(dataLine, @"[^\s""']+|"([^""]*)""|\'([^\']*)");
Russell
  • 17,481
  • 23
  • 81
  • 125
  • This is still a syntax error for the same reason. You need to use `""` to represent a single `"` inside of a verbatim string literal (using the `@` syntax). – John Oct 08 '18 at 22:32
  • Thanks - good pickup; fixed – Russell Oct 08 '18 at 22:36
0

The problems are the double quotes inside the regex, the compiler chokes on them, think they are the end of string. You must escape them, like this:

"[^\s\"']+|\"([^\"]*)\"|\'([^\']*)"

Edit:

You can actually do all, that you want with one regex, without first splitting:

@"(?<=[""])[^,]*?(?=[""])"

Here I use an @ quoted string where double quotes are doubled instead of escaped.

The regex uses look behind to look for a double quote, then matching any character except comma ',' zero ore more times, then looks ahead for a double quote.

How to use:

string test = @"""0"",""Column"",""column2"",""Column3""";
Regex regex = new Regex(@"(?<=[""])[^,]*?(?=[""])");
foreach (Match match in regex.Matches(test))
{
    Console.WriteLine(match.Value);
}
John
  • 2,395
  • 15
  • 21
Poul Bak
  • 10,450
  • 5
  • 32
  • 57
0

You need to escape the double quotes inside of your regular expression, as they're closing the string literal. Also, to handle 'unrecognized escape sequences', you'll need to escape the \ in \s.

Two ways to do this:

  • Escape all the characters of concern using backslashes: "[^\\s\"']+|\"([^\"]*)\"|\'([^\']*)"
  • Use the @ syntax to denote a "verbatim" string literal. Double quotes still need to be escaped, but instead using "" for every ": @"[^\s""']+|""([^""]*)""|'([^']*)"

Regardless, when I test out your new regular expression it appears to be capturing some empty groups as well, see here: https://dotnetfiddle.net/1WQE4R

John
  • 2,395
  • 15
  • 21