7

I am using Commons CSV to parse CSV content relating to TV shows. One of the shows has a show name which includes double quotes;

116,6,2,29 Sep 10,""JJ" (60 min)","http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj"

The showname is "JJ" (60 min) which is already in double quotes. This is throwing an IOException java.io.IOException: (line 1) invalid char between encapsulated token and delimiter.

    ArrayList<String> allElements = new ArrayList<String>();
    CSVFormat csvFormat = CSVFormat.DEFAULT;
    CSVParser csvFileParser = new CSVParser(new StringReader(line), csvFormat);

    List<CSVRecord> csvRecords = null;

    csvRecords = csvFileParser.getRecords();

    for (CSVRecord record : csvRecords) {
        int length = record.size();
        for (int x = 0; x < length; x++) {
            allElements.add(record.get(x));
        }
    }

    csvFileParser.close();
    return allElements;

CSVFormat.DEFAULT already sets withQuote('"')

I think that this CSV is not properly formatted as ""JJ" (60 min)" should be """JJ"" (60 min)" - but is there a way to get commons CSV to handle this or do I need to fix this entry manually?

Additional information: Other show names contain spaces and commas within the CSV entry and are placed within double quotes.

mhollander38
  • 775
  • 3
  • 11
  • 22

5 Answers5

7

The problem here is that the quotes are not properly escaped. Your parser doesn't handle that. Try univocity-parsers as this is the only parser for java I know that can handle unescaped quotes inside a quoted value. It is also 4 times faster than Commons CSV. Try this code:

//configure the parser to handle your situation
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(STOP_AT_CLOSING_QUOTE);

//create the parser
CsvParser parser = new CsvParser(settings);

//parse your line
String[] out = parser.parseLine("116,6,2,29 Sep 10,\"\"JJ\" (60 min)\",\"http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj\"");

for(String e : out){
    System.out.println(e);
}

This will print:

116
6
2
29 Sep 10
"JJ" (60 min)
http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj

Hope it helps.

Disclosure: I'm the author of this library, it's open source and free (Apache 2.0 license)

Jeronimo Backes
  • 6,141
  • 2
  • 25
  • 29
2

Quoting mainly allows for field to contain separator characters. If embedded quotes in a field are not escaped, this can't work, so there isn't any point in using quotes. If your example value was "JJ", 60 Min, how is a parser to know the comma is part of the field? The data format can't handle embedded commas reliably, so if you want to be able to do that, best to change the source to generate an RFC compliant csv format.

Otherwise, it looks like the data source is simply surrounding non-numeric fields with quotes, and separating each field a comma, so the parser needs to do the reverse. You should probably just treat the data as comma-delimited and strip the leading/trailing quotes yourself with removeStart/removeEnd.

You might use CSVFormat .withQuote(null), or forget about that and just use String .split(',')

Mic
  • 331
  • 1
  • 4
  • Setting withQuote(null) does get round the IO Exception, however, it then passes the quotes through which isn't ideal it also means that shows like "I Love You, Tommy Brown (60 min)" are not coming through as "I love you and losing the rest of the name. I wasn't clear in my question that many show names contain commas and are put within quotes by the source. – mhollander38 Jun 22 '17 at 12:40
  • In that case, is there any way you can adjust the input format? If you want to continue using comma as a delimiter rather than something that's not in the data (such as tab), but you don't want to produce rfc-compliant csv, how is the parser going to know how to parse a show named: "JJ", 60 Min – Mic Jun 22 '17 at 22:30
  • @mic univocity-parsers is the only library I know of that can parse this sort of input, see my answer. – Jeronimo Backes Jun 23 '17 at 05:05
2

You can use withEscape('\\') to ignore quotes within quotes

CSVFormat csvFormat = CSVFormat.DEFAULT.withEscape('\\')

Reference: https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html

zephyr
  • 21
  • 3
1

I think that having both quotations AND spaces in the same token is what confuses the parser. Try this:

CSVFormat csvFormat = CSVFormat.DEFAULT.withQuote('"').withQuote(' ');

That should fix it.


Example

For your input line:

String line = "116,6,2,29 Sep 10,\"\"JJ\" (60 min)\",\"http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj\"";

Output is (and no exception is thrown):

[116, 6, 2, 29 Sep 10, ""JJ" (60 min)", "http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj"]
SHG
  • 2,516
  • 1
  • 14
  • 20
  • withQuote('"') is already set by DEFAULT. Unfortunately the space is a legitimate entry and contained within various other rows in the CSV in such a way I can't treat it as a quote character. – mhollander38 Jun 22 '17 at 12:34
  • @mhollander38 Space is still legitimate with my format. I'll add an example and output. – SHG Jun 22 '17 at 16:36
  • How are the two `withQuote()` calls supposed to work? Doesn't the latter override the first? – Bernhard Döbler Jun 23 '20 at 17:18
1

No need of special parsers: just add a double quote in front the double quote:

116,6,2,29 Sep 10,"""JJ"" (60 min)",...

It's all specified in RFC 4180

   7.  If double-quotes are used to enclose fields, then a double-quote
   appearing inside a field must be escaped by preceding it with
   another double quote.  For example:

   "aaa","b""bb","ccc"

This is already implemented by CSVFormat #DEFAULT.

Amit Verma
  • 8,660
  • 8
  • 35
  • 40
ntucci
  • 63
  • 1
  • 8