2

The question is pretty simple.


A CSV file looks like this:

1, "John", "John Joy"

If I want to get each column, I just use String[] splits = line.split(",");


What if the CSV file looks like this:

1, "John", "Joy, John"

So we have a comma inside a double quotes pair. The above split won't work any more, because I want "Joy, John" as a complete part.


So is there a elegant / simple algorithm to deal with this situation?


Edit:

Please do not consider it as a formal CSV parsing thing. I just use CSV as a use case where I need to split.

What I really want is NOT a proper CSV parser, instead, I just want an algorithm which can properly split a line by comma considering the double quotes.

Jackson Tale
  • 25,428
  • 34
  • 149
  • 271
  • 1
    *"A CSV file looks like this:"* A *simple* one does. A complicated one can have line breaks within the quoted field. So if you're reading it line-by-line (as you appear to be), beware that unless your line-aware code is handling the line-breaks-within-quotes thing, your `line` variable can contain only *part* of a record. – T.J. Crowder Nov 26 '12 at 11:35
  • 3
    You need a CSV parser, a simple State Machine will do. – neevek Nov 26 '12 at 11:37
  • If by elegant you mean something like a single regular expression: there isn't. CSVs are far more complex than meets the eye: multi-line fields, escaped quotes and so on. There are however CSV parser libraries you can use: OpenCSV for example, but there definitely is an Apache one too. – biziclop Nov 26 '12 at 11:41
  • How about just to deal with the above two cases? – Jackson Tale Nov 26 '12 at 15:27

5 Answers5

4

It's better to use existing library for this purpuse instead of writing custom implementation (If you don't do this for studing). Because CSV has some specifics that you can miss in custom implementation and usually library is well tested.

Here you can find some good one Can you recommend a Java library for reading (and possibly writing) CSV files?

EDIT

I've created method that will parse your string but again it could work not perfect because I haven't tested it well. It could be just as a start point for you and you can improve it further.

    String inputString = "1, \"John\",\"Joy, John\"";
    char quote = '"';
    List<String> csvList = new ArrayList<String>();
    boolean inQuote = false;
    int lastStart = 0;
    for (int i = 0; i < inputString.length(); i++) {
        if ((i + 1) == inputString.length()) {
            //if this is the last character
            csvList.add(inputString.substring(lastStart, i + 1));
        }
        if (inputString.charAt(i) == quote) {
            //if the character is quote
            if (inQuote) {
                inQuote = false;
                continue; //escape
            }
            inQuote = true;
            continue;
        }
        if (inputString.charAt(i) == ',') {
            if (inQuote) continue;
            csvList.add(inputString.substring(lastStart, i));
            lastStart = i + 1;
        }
    }
    System.out.println(csvList);

Question for you

What if you will get string like that 1, "John", ""Joy, John"" (two quotes on "Joy, John")?

Community
  • 1
  • 1
nkukhar
  • 1,975
  • 2
  • 18
  • 37
1
// use regxep with matcher

String string1 = "\"John\", \"John Joy\"";
String string2 = "\"John\", \"Joy, John\"";
Pattern pattern = Pattern.compile("\"[^\"]+\"");

Matcher matcher = pattern.matcher(string1);
System.out.println("string1: " + string1);
int start = 0;
while(matcher.find(start)){
    System.out.println(matcher.group());
    start = matcher.end() + 1;
    if(start > string1.length())
    break;
}

matcher = pattern.matcher(string2);
System.out.println("string2: " + string2);
start = 0;
while(matcher.find(start)){
    System.out.println(matcher.group());
    start = matcher.end() + 1;
    if(start > string2.length())
    break;
}
akjoshi
  • 15,374
  • 13
  • 103
  • 121
Alex
  • 11
  • 1
  • Alex: It will be great if you explain your answer a bit like why you think it's elegant or the approach it takes etc. – akjoshi Nov 27 '12 at 06:25
0

Using regular expressions is quite elegant.
Sorry, I don't familiar with Java regex, so my example is in Lua:
(this example doesn't take into account that there may be newline chars inside quoted text, and that original quote chars would be doubled inside quoted text)

--- file.csv
1, "John", "John Joy"
2, "John", "Joy, John"

--- Lua code
for line in io.lines 'file.csv' do
   print '==='
   for _, s in (line..','):gmatch '%s*("?)(.-)%1%s*,' do
      print(s)
   end
end

--- Output
===
1
John
John Joy
===
2
John
Joy, John
Egor Skriptunoff
  • 23,359
  • 2
  • 34
  • 64
0

You could start with the regular expression:

[^",]*|"[^"]*"

which matches either a non-quoted string not containing a comma or a quoted string. However, there are lots of questions, including:

  1. Do you really have spaces after the commas in your input? Or, more generally, will you allow quotes which are not exactly at the first character of a field?

  2. How do you put quotes around a field which includes a quote?

Depending on how you answer that question, you might end up with different regular expressions. (Indeed, the customary advice to use a CSV parsing library is not so much about handling the corner cases; it is about not having to think about them because you assume "standard CSV" handling, whatever that might be according to the author of the parsing library. CSV is a mess.)

One regular expression I've used with some success (although it is not CSV compatible) is:

(?:[^",]|"[^"]*")*

which is pretty similar to the first one, except that it allows any number of concatenated fields, so that both of the following are all recognized as a single field:

"John"", Mary"
John", "Mary

CSV standard would treat the first one as representing:

John", Mary    -- internal quote

and treat the quotes in the second one as ordinary characters, resulting in two fields. So YMMV.

In any event, once you decide on an appropriate regex, the algorithm is simple. In pseudo-code since I'm far from a Java expert.

repeat:
   match the regex at the current position
     and append the result to the result;
   if the match fails:
     report error
   if the match goes to the end of the string:
     done
   if the next character is a ',':
     advance the position by one
   otherwise:
     report error

Depending on the regex, the two conditions under which you report an error might not be possible. Generally, the first one will trigger if the quoted field is not terminated (and you need to decide whether to allow new-lines in the quoted field -- CSV does). The second one might happen if you used the first regex I provided and then didn't immediately follow the quoted string with a comma.

rici
  • 234,347
  • 28
  • 237
  • 341
-1

First split the string on quotes. Odd segments will have quoted content; even ones will have to be split one more time on commas. I use it on logs, where quoted text doesn't have escaped quotes, just like in this question.

    boolean quoted = false;
    for(String q : str.split("\"")) {
        if(quoted)
            System.out.println(q.trim());
        else
            for(String s : q.split(","))
                if(!s.trim().isEmpty())
                    System.out.println(s.trim());
        quoted = !quoted;
    }
Grk58
  • 11
  • 3
  • 1
    Explain something about your answer. – Kumar May 15 '15 at 07:31
  • just don't like regex. the code above should split the str as requested. – Grk58 May 15 '15 at 07:36
  • I think that what @Kumar is getting at is that code-only answers are usually not acceptable (e.g. [this meta.stackexchange question](http://meta.stackexchange.com/q/148272)). – Wai Ha Lee May 15 '15 at 07:51