0

I have a CSV splitter with following regex for splitting a string with comma.

String[] splitData = splitCSV.split(",(?=(?:[^\"]*\"[^\"]*\"^\")*[^\"]*$)");

It works so far for String like 123, "foo", "bar", "no, split, here" but when it encounters an inch sign(") like following it cannot do the splitting.

"123, 1.0" xyz"

I need it to split into 123 and 1.0" xyz

Hope someone can provide a solution for this. Thank you.

SajithRu
  • 225
  • 1
  • 8
  • 24
  • Could you provide some data ? – Rahul Apr 21 '17 at 08:51
  • @Sajirupee : Probably because inches separate strings. Id use these `'` inches ? And Id like to know if the program does compile, and please show the output you are getting. – Luatic Apr 21 '17 at 08:52
  • You didn't show us expected output of first input string. – revo Apr 21 '17 at 08:57
  • @revo first string split into `123` `foo` `bar` and `no, split, here` – SajithRu Apr 21 '17 at 09:00
  • @user7185318 Program does compile. after the split output looks like `123, 1.0" xyz` which inserted into `ArrayList` of 1 index but I'm expecting 2. – SajithRu Apr 21 '17 at 09:05
  • Since Java has no **raw-string** you can [Try this instead of your string](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#quote%28java.lang.String%29) – Shakiba Moshiri Apr 21 '17 at 09:14

2 Answers2

2

A couple of points here:

  1. You should be using an existing CSV processing library, not creating your own with a regex. There are many available for Java, see this question as a starting point. This is a solved problem; there's no reason to reinvent it.
  2. The scenario you mention would be invalid* data. A quote should be escaped within a string, usually by using two quotes together. Having one unescaped quote makes the file invalid; and furthermore there is usually no reliable way to tell what the file "should" be once you have these sorts of errors. What to do about it:

    • If the file is within your control, correct it. Use a standard escape format for quotes within a string.
    • If the file is not within your control, you should handle errors separately rather than including this in your core processing. Either preprocess the file looking for errors, or use error handling available in a CSV library to do something with the lines that come back as having an incorrect format. If the errors are limited to a predictable issue that you know ahead of time, you might be able to correct them. But in most cases errors like this lead you to have to reject the lines.

*Technically there is no CSV standard, so anything goes. But this would be a data error in any reasonable format. And in the real world this almost always occurs because someone didn't think the file format through, not because they intentionally planned it this way.

Community
  • 1
  • 1
  • Thank you for the help. I ended up using Apache common csv library. It resolved my issue without any trouble. – SajithRu May 02 '17 at 03:46
1

What you have here is an unusual dialect of CSV.

Although there is no formalised standard for CSV, there are broadly two approaches to quotes:

  1. Quotes are not special. That is: 7" single, 12" album is two items: 7" single and 12" album. In this dialect, items containing , are problematic.
  2. Quotes are special. That is: "you, me","me you" is two items: you, me and me, you. In this dialect, you can put quotes around an entry in order to have a , within an item. However it makes items containing " problematic, as you have found.

The typical answer to the " problem in the second approach, is to escape quotes. So the item 7" single would appear in the CSV as "7\" single". This of course means that \ becomes a problem, but that's easily solved the same way. AC\DC 7" single appears in the CSV as "AC\\DC 7\" single".

If you can adopt one of these conventional approaches, then do so. Then you can either use an existing CSV library, or roll your own. Although a regex can consume these formats, my opinion is that it's not the clearest way to write code to consume CSV: I've found that a more explicit state machine (e.g. a switch (state) statement) is nice and clear.

If you can't change your input format, the puzzle you have to solve is, when you encounter a ", is it a metacharacter (part of a pair of quotes surrounding an item) or is it a real character that's part of the item?

As owner of the format, it's up to you to decide what the rule is. Perhaps a " should only be considered a metacharacter if it's next to a ,. But even that causes problems if you allow a mixture of quoted and unquoted items:

 "A Town Called Malice", The Jam, 7", £6.99

So, you must come up with your own rules, that work in your domain, and write explicit code to handle that situation. One approach is to pre-process the input into canonical CSV so that it's again suitable for a conventional CSV parser.

slim
  • 40,215
  • 13
  • 94
  • 127