2

I am trying to read a file and hence I am splitting the fields when I receive ',' comma separator . However some fields have ',' in them but they are enclosed within double quotes hence how can I split it escaping the , separator. Here is what I have done

String[] cols = line.split(Pattern.quote(","));

How should I modify this using split() only in Java. Also what changes will I have to make in case the separator is a pipe '|'?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
justin3250
  • 303
  • 3
  • 10
  • 19

3 Answers3

5

I answered a similar question here. The first expression, modified for your task, would read

,(?=([^"]*"[^"]*")*[^"]*$)

This expression identifies an unquoted comma by ensuring that an even number of quotation marks follows it.

Community
  • 1
  • 1
Jens
  • 25,229
  • 9
  • 75
  • 117
  • +1 for the regex. I just want to mention that at each comma encountered, the regex engine have to scan through the rest of the file to figure out if there are even number of quotation marks ahead and this might be a huge overhead if the file is large. – Narendra Yadala Nov 11 '11 at 07:25
  • hii i tried your regex in my code String[] cols = line.split(Pattern.quote(",(?=([^"]*"[^"]*")*[^"]*$) ")); and it shows me The operator * is undefined for the argument type(s) java.lang.String, java.lang.String .. Im pretty naive at regex hence i cant understand wat this means – justin3250 Nov 11 '11 at 07:32
  • 1
    @justin3250: To represent the regex as a java string, you'd need to escape the quotes, i.e. `",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"` – Jens Nov 11 '11 at 08:31
  • @Nerendra: I suppose that not that bad if you handle every line seperately. Unless the lines are very long. Good point though! – Jens Nov 11 '11 at 08:33
  • wow thank you so much this really helped me a lot .. However the (",) at he start of the regex symbolises the seperator to be split ? If i have a pipe delimited file and i change the regex as line.split"|(?=([^\"]*\"[^\"]*\")*[^\"]*$)"; then it gives me the a very bad split and im nt able to understand how it is spliting – justin3250 Nov 11 '11 at 09:09
  • 1
    @justin3250: Thats because "|" has special meaning in regular expressions, and therefore needs to be escaped by using "\|". And the baskslash needs to be escaped for the Java string, resulting in `"\\|(?=([^\"]*\"[^\"]*\")*[^\"]*$)"` – Jens Nov 11 '11 at 09:27
3

I wouldn't try using a regex for this. Regular expressions are just not a great match for this - while it may be possible to create such a regex, it would be horrible to read.

There are plenty of open source CSV parsers. Just a quick search found many projects - I would look through those before writing your own.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
0
    String  line="one|two,three";
    String cols[]= line.split("[,|\\|]");

Something like the above would split based on , and |

For meta-character | you would have to delimit with \ \ I agree with others; it's better to use CSV parsers out there rather than reinventing it again.

questborn
  • 2,735
  • 2
  • 16
  • 17