2

I'm initially looking at using Apache Common's csv library's CSVPrinter, and it provides different record separator choices. Either \n, \r or \r\n. Or I could just set using System.lineSeparator(). However, this is just honoring the line separator convention on the producer platform. My concern is, if I do not have control on what the consumer platform and language they choose from, how do I minimize the risk of a consumer erroneously read \r into their parsing record? For example, if consumer is in C++ using getline() to read a new line.

Is it safe to always just specify only \n as the record separator on the producer part? Would any program on a windows/dos platform then consume and recognize the line changes properly? If I just use java's own BufferedWriter.newLine() would the same problem still exist? (in that it's writing whatever line separator on the producer system but has no control how consumer will perceive it)?

If just using \n is the safest thing to do, I'm not sure why it seems the most prevalent CSVFormat being used (or so I thought?) in apache commons csv is still setting recordseparator to \r\n, in both DEFAULT and EXCEL format?

Superziyi
  • 609
  • 1
  • 9
  • 30

1 Answers1

1

tl;dr

Use CRLF (Carriage Return, Line Feed) to terminate lines, per RFC 4180, the only well-written specification for CSV tabular data files.

Follow the spec: CRLF

All kinds of people have been writing all kinds of documents with all kinds of formats… all the while calling them “CSV”. After decades of trouble and confusion, some folks finally wrote down a specification for exactly what “CSV” means. That spec was published by The Internet Society (2005) through the Internet Engineering Task Force (IETF).

RFC 4180, Common Format and MIME Type for Comma-Separated Values (CSV) Files, is the specification for CSV format. The spec is augmented by RFC 7111.

RFC 4180 requires CRLF as line delimiters. Section 2.1 of RFC 4180 clearly states:

Each record is located on a separate line, delimited by a line break (CRLF).

So terminate each line with a CARRIAGE RETURN and LINE FEED. The Unicode code points are 13 and 10 (decimal).

Every platform can parse CRLF. Communicate to the consumers of your CSV files that you are using the RFC 4180 standard format including CRLF line delimiters.

By the way… A decade after RFC 4180, the W3C felt the need to write their own standard, to address the supposed deficiencies of the RFC 4180 specification. If you feel the need, study Model for Tabular Data and Metadata on the Web, and related documents. With breath-taking decisiveness, the W3C declared the line terminator to be… CRLF or LF. Yes, a specification consciously written to not be specific. I stopped reading there; I recommend you stick with RFC 4180. And even the W3C says the line endings “should be CRLF”.

Apache Commons CSV supports RFC 4180

You are using Apache Commons CSV library. Note that the library provides a predefined CSVFormat class supporting RFC 4180 standard format: CSVFormat.RFC4180.

You asked:

I'm not sure why … apache commons csv is still setting recordseparator to \r\n

Because the standard says so.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
  • I know that class that's what I linked in the question. The reason they provided so many different format options seem to mean there are several many legit ways to write CSV files though, right? This answer in this post says `In practice, in the modern context of writing to a text file, you should always use \n` https://stackoverflow.com/a/1761086/1208309 Is the idea that consumer is responsible for parsing the right way but we should not just discard really old platforms that does not recognize `\n` as line change? – Superziyi Oct 07 '22 at 01:39