1

Here is my code snippet which I am using:

    StringWriter writer = new StringWriter();
    CSVWriter csvwriter = new CSVWriter(writer);
    String[] originalValues = new String[2];
    originalValues[0] = "t\\est";
    originalValues[1] = "t\\est";
    System.out.println("Original values: " + originalValues[0] +"," + originalValues[1]);
    csvwriter.writeNext(originalValues);

    csvwriter.close();
    CSVReader csvReader = new CSVReader(new StringReader(writer.toString()));
    String[] resultingValues = csvReader.readNext();
    System.out.println("Resulting values: " + resultingValues[0] +"," + resultingValues[1]);

The output of the above snippet is:

Original values: t\est,t\est
Resulting values: test,test

Back slash ('\') character is gone after conversion!!!

By some basic analysis I figured that it is happening because CSVReader is using Back slash ('\') as default escape character where as CSVWriter is using double quote ('"') as default escape character.

What is the reason behind this inconsistency in default behavior?

To fix above problem I managed to find following two solutions:

1) Overwriting default escape character of CSVReader with null character:

 CSVParser csvParser = new CSVParserBuilder().withEscapeChar('\0').build();
 CSVReader csvReader = new CSVReaderBuilder(new StringReader(writer.toString())).withCSVParser(csvParser).build();

2) Using RFC4180Parser which strictly follows RFC4180 standards:

RFC4180Parser rfc4180Parser = new RFC4180ParserBuilder().build();
CSVReader csvReader = new CSVReaderBuilder(new StringReader(writer.toString())).withCSVParser(rfc4180Parser).build();

Can using any of the above approach cause any side effects on any other characters?

Also why RFC4180Parser is not default parser? Is it only for maintaining backward compatibility as RFC4180Parser got introduced in later versions?

vatsal mevada
  • 5,148
  • 7
  • 39
  • 68
  • CSV 'market' has poor acceptance of standard, I guess for many programmers standard is not known. Many home-brew solution exists with subtle hidden errors. Many business software is build over not-totally-correct formats – Jacek Cz Sep 10 '17 at 10:39
  • @JacekCz I agree with the fact you mentioned. However I am more curious about the inconsistency in API's default behavior and correctness of the solutions I mentioned in the post. – vatsal mevada Sep 10 '17 at 10:58
  • 1
    Hello Vistal - please check my response in the sourceforge support request you opened. https://sourceforge.net/p/opencsv/support-requests/50/ – Scott Conway Sep 11 '17 at 14:06
  • @ScottConway very well explained on that support request. I wish SOF allows comments to be marked as accepted solution. – vatsal mevada Sep 16 '17 at 18:43

1 Answers1

0

I think we are looking at 2 types of escaping here.

1) Escaping the double quote in csv:

test,"Monitor 24"", Samsung"
test,"Monitor 24\", Samsung"  // Linux style

Since we have a comma in the second field, that field has to be surrounded with double quotes. Any double quotes inside that field then have to be escaped, with "" or \".

2) \ is also a general escape character, for example \t (tab) or \n (newline).

And since 'e' is not in the list of characters to escape, the \ is simply ignored and removed.

So if you would write "t\\\\est" the file would contain "t\\est" (escaped backslash) and show "t\est" after reading. Or writing "\\test" would probably show a tab and "est" after reading.

To keep the \ after reading, you would indeed have to tell the parser somehow to ignore those sequences, but the current behaviour doesn't look inconsistent to me - actually they are both treating the \ as escape character.

Danny_ds
  • 11,201
  • 1
  • 24
  • 46