-1

Possible Duplicate:
C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas
regex to parse csv

I know this question has asked many time, but there were different answers; I am confused.

My row is:

1,3.2,BCD,"qwer 47"" ""dfg""",1

The optionally quoting and double quoting MS Excel standard. (The data: qwer 47" "dfg" is represented like this "qwer 47"" ""dfg""".)

I need a regex.

Community
  • 1
  • 1
Trupti Swain
  • 31
  • 1
  • 1
  • 3

3 Answers3

7

OK, you've seen from the comments that regex is so not the right tool for this. But if you insist, here goes:

This regex will work in Java (or .NET and other implementations that support possessive quantifiers and verbose regexes):

^            # Start of string
(?:          # Match the following:
 (?:         #  Either match
  [^",\n]*+  #   0 or more characters except comma, quote or newline
 |           #  or
  "          #   an opening quote
  (?:        #   followed by either
   [^"]*+    #    0 or more non-quote characters
  |          #   or
   ""        #    an escaped quote ("")
  )*         #   any number of times
  "          #   followed by a closing quote
 )           #  End of alternation
 ,           #  Match a comma (separating the CSV columns)
)*           # Do this zero or more times.
(?:          # Then match
 (?:         #  using the same rules as above
  [^",\n]*+  #  an unquoted CSV field
 |           #  or a quoted CSV field
  "(?:[^"]*+|"")*"
 )           #  End of alternation
)            # End of non-capturing group
$            # End of string

Java code:

boolean foundMatch = subjectString.matches(
    "(?x)^         # Start of string\n" +
    "(?:           # Match the following:\n" +
    " (?:          #  Either match\n" +
    "  [^\",\\n]*+ #   0 or more characters except comma, quote or newline\n" +
    " |            #  or\n" +
    "  \"          #   an opening quote\n" +
    "  (?:         #   followed by either\n" +
    "   [^\"]*+    #    0 or more non-quote characters\n" +
    "  |           #   or\n" +
    "   \"\"       #    an escaped quote (\"\")\n" +
    "  )*          #   any number of times\n" +
    "  \"          #   followed by a closing quote\n" +
    " )            #  End of alternation\n" +
    " ,            #  Match a comma (separating the CSV columns)\n" +
    ")*            # Do this zero or more times.\n" +
    "(?:           # Then match\n" +
    " (?:          #  using the same rules as above\n" +
    "  [^\",\\n]*+ #  an unquoted CSV field\n" +
    " |            #  or a quoted CSV field\n" +
    "  \"(?:[^\"]*+|\"\")*\"\n" +
    " )            #  End of alternation\n" +
    ")             # End of non-capturing group\n" +
    "$             # End of string");

Be aware that you can't assume that every line in a CSV file is a complete row. You can have newlines within a CSV row (as long as the column containing the newlines is enclosed in quotes). This regex knows this, but it will fail if you feed it only a partial row. Which is yet another reason why you really need a CSV parser to validate a CSV file. That's what parsers do. If you control your input and know that you'll never have newlines inside a CSV field, you might get away with it, but only then.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • how to read the CSV fields with this regex? – Rez.Net Jul 31 '14 at 07:14
  • In my experience, this type of regex may end up in catastrophic backtracking http://www.regular-expressions.info/catastrophic.html and crash your system. This is the case if you are missing the last quote of a quoted string (i.e. csv line is truncated / corrupted) – MrE Sep 22 '17 at 00:46
1

I haven't done Java in a while, so here's a pseudocode to do this. You could use this as a function that accepts a String representing a row of your csv.

1. Split the row by "'" delimiter into an array of strings. (method might be called string.split())
2. Iterate through the array (cells).
    3. If the current string (cell) contains a double quote:
        4. If it doesn't start with a quote - return false; else remove that quote
        5. If it doesn't end with a quote - return false; else remove that quote
        6. Iterate through the remaining characters of the string
            7. If a quote is found, check if the next character is also a quote - if it is not - return false
        7. End character iteration
    8. End if
9. End array iteration
10. Return true
neeKo
  • 4,280
  • 23
  • 31
  • hi niko, this what i was trying to avoid. I want it be validated in one shot – Trupti Swain Oct 30 '11 at 20:15
  • 1
    regex is hardly a "one shot" - it is very likely to be more expensive than this code. And if you need a one-liner, use it as a method you would call. Do you have restrictions on code length or similar? – neeKo Oct 30 '11 at 20:19
  • according to my understanding regex are faster in string parsing – Trupti Swain Oct 30 '11 at 20:38
  • 1
    For a complex regex such as you require it's pretty much guaranteed to be slower than a specialized coded solution. Regex still has to iterate through the string, and has to iterate through itself and use complex rules to match patterns. In this code we aren't really matching anything - we are looking for a specific character. Regex would be an overkill. – neeKo Oct 30 '11 at 20:56
  • This fails where there is a comma within quotes. :-) – Kugel Sep 23 '13 at 01:43
0

I use regexp from this blog article which is about the same problem you are trying to solve.

See it here: http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html

In short ^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

Petr
  • 3,214
  • 18
  • 21