2

I am looking for a java regex which will escape the doublequote within an excel cell.

I have followed this example but need another change in the regular expression to make it work for escaping doublequote within one of the cells.

Parsing CSV input with a RegEx in java

private final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");

Example Data:

"A,B","2" size","text1,text2, text3"

The regex from above fails at 2".

I want the output to be as below .Doesn't matter if the outer double quotes are there or not.

"A,B"
"2" size"
"text1,text2, text3"

Community
  • 1
  • 1
user1864665
  • 51
  • 1
  • 4
  • Well, what do you want the output to be? – Martin Ender Nov 29 '12 at 22:47
  • Why on earth would you parse csv with RegEx? I could see using RegEx on individual tokens once you've gotten those through conventional means but parsing the whole csv with RegEx is just a terrible idea. If I were conducting a code review on this I would tell you to rewrite the entire thing. – evanmcdonnal Nov 29 '12 at 22:55
  • 2
    Why are you doing it with regex, instead of something like [OpenCSV](http://opencsv.sourceforge.net) ? – jlordo Nov 29 '12 at 22:55
  • @jlordo I don't have an option of using third party libraries. This regex was working fine until we came across data that may have a double quote within a cell data – user1864665 Nov 29 '12 at 23:03
  • Is _no third party libraries_ part of the homework or corporate requirement? – jlordo Nov 29 '12 at 23:05
  • Do you wanna do this with Regex or are you accepting better sollutions? – Ricardo Souza Nov 29 '12 at 23:08
  • 3
    Your input data is ambiguous, it is impossible to design either a regex or CSV solution that can properly handle all cases if you use double quotes as a delimiter, and also can include unescaped double quotes within fields. For example `"foo","bar"` could be the fields `foo` and `bar`, or it could be one field with the contents `foo","bar`. – Andrew Clark Nov 29 '12 at 23:11
  • @F.J exactly what I want to say. there could be comma, double quote in content. so I could consider that the whole string (in OPs example) is just one field. – Kent Nov 29 '12 at 23:15
  • @rcdmk I can accept better solutions too – user1864665 Nov 29 '12 at 23:17
  • @F.J Is there any other solution that you can advise other than regex? The second field is a size field and the inch representation is what is causing this kind of data. 2"=2 inches – user1864665 Nov 29 '12 at 23:19
  • 3
    @user1864665 You need to work on whatever is generating that data, so that any double quotes within fields are escaped. Your data should really look something like this: `"A,B","2\" size","text1,text2, text3"`. If you can get it to that point, then CSV is your best bet and regex would be possible (but complicated). – Andrew Clark Nov 29 '12 at 23:21

2 Answers2

0

while I agree, that using regex for parsing a CVS is not really the best way, a slightly better pattern is:

Pattern pattern = Pattern.compile("^\"([^\"]*)\",|,\"([^\"]*)\",|,\"([^\"]*)\"$|(?<=,|^)([^,]*)(?=,|$)");

This will terminate a cell value only after quote and comma, or start it after a command and a quote.

eppesuig
  • 1,375
  • 2
  • 12
  • 28
  • Thanks for the regex. Actually when the csv is opened in excel and saved back with modified data, double quotes surrounding a cell are gone. It preserves the double quotes if the cell has a comma within. So, in that case, my data is becoming, "A,B",2" size,"text1,text2, text3",text4,text5,"text6,text7" (I added three more columns to better understand). – user1864665 Dec 07 '12 at 21:56
  • What about double quote inside the text? Are these escaped in any way? – eppesuig Dec 08 '12 at 05:59
  • No, a double quote within a cell is not escaped. :( – user1864665 Dec 10 '12 at 21:38
  • I can have the double quote escaped with another double quote. For example, 2" will be 2"". How should I change the regex from my original post to ignore the escaped double quotes within a cell? – user1864665 Jan 07 '13 at 23:07
  • Your pattern may match a certain string, but cannot change it. So, the double quote removing should be done after matching. Please accept this whole answer if you like it. – eppesuig Jan 07 '13 at 23:57
0

well as F.J commented, the input data is ambiguous. But for your example input, you could try

  • string.split("\",\"") method to get a String[]. after this, you got an array with 3 elements:
[
"A,B, 
2" size, 
text1,text2, text3"
]
  • remove the first character (which is double quote) of the first element of the array
  • remove the last character (which is double quote) of the last element of the array
Kent
  • 189,393
  • 32
  • 233
  • 301