1

I know this has been discussed a million times. I tried searching through the forums and have seen some close regex expressions and tried to modify them but to no avail.

Say there is a line in a CSV file like this:

"123", 456, "701 "B" Street", 910
                 ^^^

Is there an easy regex to detect "B" (since it's a non-escaped set of quotes within the normal CSV quotes) and replace it with something like \"B\" ? The final string would end up looking like this:

"123", 456, "701 \"B\" Street", 910

Help would be greatly appreciated!

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
user361970
  • 11
  • 2
  • Are 456 and 910 supposed to be quoted? Or do you only have some of the fields in the CSV quoted? – Brent Writes Code Jun 09 '10 at 02:07
  • possible duplicate of [Parsing CSV input with a RegEx in java](http://stackoverflow.com/questions/1441556/parsing-csv-input-with-a-regex-in-java) –  Jun 09 '10 at 02:07
  • Some fields are quoted and some are not unfortunately – user361970 Jun 09 '10 at 02:09
  • Something like this seems close but not quite what I'm looking for: http://stackoverflow.com/questions/1601780/regular-expression-to-find-unescaped-double-quotes-in-csv-file – user361970 Jun 09 '10 at 02:11

5 Answers5

4

Trust me you don't want to do this with regex. You want something like Java CSV Library.

  • Yes, I agree. Unfortunately, I'm a lowly developer using a StreamTokenizer based solution that I unfortunately cant just scrap. It would work fine if those inner quotes were escaped however. – user361970 Jun 09 '10 at 02:26
  • 1
    @user361970 - if you have a broken solution that you need to fix, *of course* you can scrap it and do it better. Surely, we cannot be talking about more than 100 lines of code here. If your boss says otherwise, send him to SO so that we can explain to him why it is a bad idea to patch bad code. – Stephen C Jun 09 '10 at 04:16
  • StreamTokenizer is even WORSE –  Jun 09 '10 at 13:12
1
(?<!^)(?<!",)(?<!\d,)"(?!,")(?!,\d)(?!$)(?!,-\d)

I got this to work, thought I would post it if anyone else is looking for an answer

kenf
  • 28
  • 7
1

There are a few zillion libraries to help you parse CSV, but if you're wanting to use a regexp for academic reasons, this may help:

  • quoted string with escape support. "(\\.|[^\\"])*"
  • unquoted field: [^",]*
  • delimiter: , *

I don't use CSV files, so I'm not sure about the 'other csv field' validity (matching 456, for example above), or whether /, */ is the delimiter you want..

At any rate, combining the above will match one field and one delimiter (or end of string):

(quotedstring|unquoted)(delimiter|$)
0

I would use a tailored sed expression as

's/\(.*\),\(.*\),\(.*\)"\(.*\)\" \(.*\),\(.*\)/\1,\2,\3 \4 \5 \6/g'
ring bearer
  • 20,383
  • 7
  • 59
  • 72
  • This might be the way to go in the interm – user361970 Jun 09 '10 at 03:11
  • how would I modify this for escaping with \ instead of replacing with an empty string? – user361970 Jun 09 '10 at 03:27
  • Simple `'s/\(.*\),\(.*\),\(.*\)"\(.*\)\" \(.*\),\(.*\)/\1,\2,\3 \\\"\4\\" \5 \6/g'` Note that \\ will cause to print \ and " will print a " around \4 Hope that answers it. – ring bearer Jun 09 '10 at 03:57
  • I guess I need to take some sed lessons. I get this testing it in cygwin sed: -e expression #1, char 58: invalid reference \6 on `s' command's RHS – user361970 Jun 09 '10 at 14:58
  • Formatting has messed up above sed expression. Look at my original answer, keep the regex part as is.. just change \4 as \\\"\4\\" – ring bearer Jun 09 '10 at 15:48
0

Your example is not proper CSV:

"123", 456, "701 "B" Street", 910

this should actually be:

"123", 456, "701 ""B"" Street", 910

(There are plenty of variations of CSV, of course, but since most of the time people want it for use with excel or access I stick to the Microsoft definition.)

Therefore the regex for this can look like:

".+("").+("").+"

The groups (in parentheses) will be your double quotes, and the rest ensures that they are found within another set of quotes.

That covers the find part of your needs. The replace part depends on what you are programming in.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Not exactly. In the CSV case, you're looking for a pattern like `([^"]|"")*` : matches tokens made from non-quotes or two quotes. `.+` might match single quotes anyway, and `.+("").+("").+` assumes a too-specific format - it only allows two quotes, and `+` requires characters before, between and after them. – Kobi Jun 09 '10 at 04:32
  • Ahh...right you are. That's what I get for answering so close after dinner. I always get my plusses and asterisks confused... – Richard Merren Jun 09 '10 at 06:20