53

I was thinking to myself that the line breaking problem must be somewhat solved by someone, but maybe not widely adopted. Being forward thinking, I went to search to see if there was a platform independent unicode method to separate lines. In my search I found unicode character 2028. Then, I found Jeff Atwoods post on this topic where he mentions that he's "...not sure under what circumstances you would want those Unicode newline markers."

Well, me too. I did a little digging in the C# source code and it looks like LS (x2028) is not supported by TextReader.ReadLine() and it is also not supported in Java's BufferedReader.ReadLine(). So, my conclusion is that it is not widely supported.

I would love to have a bright future where I can write files using a single format in Linux, MacOS and Windows. Does this little character have promise? What is it currently used for?

Elijah
  • 13,368
  • 10
  • 57
  • 89

2 Answers2

15

Nicked from McDowell’s comment on the same page, and indirectly from the Unicode docs:

Traditionally, NLF started out as a line separator (and sometimes record separator). It is still used as a line separator in simple text editors such as program editors. As platforms and programs started to handle word processing with automatic line-wrap, these characters were reinterpreted to stand for paragraph separators. For example, even such simple programs as the Windows Notepad program and the Mac SimpleText program interpret their platform’s NLF as a paragraph separator, not a line separator.

NLF (New Line Function) in this context is shorthand for CR, LF and CRLF. By contrast, the two Unicode characters have unambiguous uses.

Rory O'Kane
  • 29,210
  • 11
  • 96
  • 131
MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Thanks for the link to the unicode docs! They go more into the `LS` (2028). It's some kind of option for `CR` or `LF`. Further: " A line separator indicates where a line break alone should occur, typically within a paragraph. ... For comparison, line separators basically correspond to HTML
    "
    – BurninLeo Jul 14 '16 at 09:02
  • 1
    It has another advantage - in a comma or tab delimited file, it can replace newlines in a column that is multiline, without complicating the processing of the file (for example with simple shell pipe tools). – Amir Abiri Feb 04 '18 at 09:34
  • 2
    @AmirAbiri Good thinking. Note, though, that on reading this file, you will usually have to then replace `LS` with a line break supported by the program, often represented by the `\n` escape sequence in strings. For example, in Python 2: `u'First line\u2028Second line'.replace(u'\u2028', u'\n')` – Daniel Werner Jan 25 '19 at 17:37
  • I agree. but using alternate more common controls has some problems caused by their ambiguity and their use in many file formats or data structures. If we want to make sure we can embed a lien separator or paragraph separator in those case, LS and PS will do the trick (as they are not used for these file/data formats). And their standard support in browsers and renderer is required and works. Then nothing prohibits to replace them later with ASCII controls (CR, LF, CR+LF) when these can work. The automatic inverse conversion however is not possible. So LS and PS remain, avoiding complications. – verdy_p Feb 09 '21 at 20:08
  • So if they are still not recognized by some packages like TextReader.ReadLine(), blame this as a bug to Java (or feature to implement ins some way, possibly with a conditional processing flag so they recognize it). It may happen however that one would want to include a PS or LS in the *middle* of the input line to process (so that's why recoignizing them as line terminators should be conditional). For compatibility, this API in Java, C#, libreadline or console APÏ was not modified (as it could break existing apps). – verdy_p Feb 09 '21 at 20:11
12

Per the Unicode Newline Guidelines, U+2029 paragraph separator (PS) unambiguously indicates an intent to separate paragraphs. U+2028 line separator (LS) does likewise for lines. The other newline function characters, LF, CR, CR+LF, and NEL, are ambiguous, with their meanings dependent on platform and application.

For example, a LF might separate paragraphs in a word processing application but only lines in a simple text editor. By contrast, PS always separates paragraphs, regardless of the type of application.

Edward Brey
  • 40,302
  • 20
  • 199
  • 253