2

Referring to: Why should text files end with a newline?

I'm writing a text editor that I want to work on macOS and Linux and be POSIX-compliant. Should I expect a CR (old MacOS) or LF (new MacOS and Linux) or should I ever expect a CR+LF (Windows) when parsing the raw text files?

user129393192
  • 797
  • 1
  • 8
  • 3
    When you write a file in text mode, `\n` is automatically converted to the appropriate newline for the current OS. So you shouldn't have to deal with this yourself. – Barmar Aug 03 '23 at 16:29
  • 1
    @Barmar But that wouldn't apply if the text editor needs to edit non-native text files. – Ian Abbott Aug 03 '23 at 16:31
  • 2
    The newline character is the newline character. If you open a file in text mode on Windows, the I/O library will convert CRLF line endings to `'\n'` (and that's a reason why there are restrictions in standard C on what you can do with text files, though the restrictions are moot on Unix-like systems). If you want to handle all three, that's noble and possible, but to handle the old MacOS style `'\r'` will require customized line reading code (standard C `fgets()` won't handle that; POSIX `getdelim()` will). Using `fgets()` or `getline()` will handle CRLF, but you'll have to remove `'\r'`. – Jonathan Leffler Aug 03 '23 at 16:32
  • In that case you should have a way for the user to specify the file format, and you write the appropriate newline. – Barmar Aug 03 '23 at 16:32
  • Ideally you should *expect* and be able to handle everything. But (depending on what you want to achieve) you only need to *write* one format of your choosing. – Konrad Rudolph Aug 03 '23 at 16:33
  • @Barmar Yes, but by default the editor could guess the line ending style based on the file contents. Vim does that, for example. – Ian Abbott Aug 03 '23 at 16:34
  • 1
    @IanAbbott Yes, and so does GNU Emacs. Look for the first CR or LF, and keep that. Explicit configuration is only needed when creating new files. – Barmar Aug 03 '23 at 16:35
  • 2
    @Barmar Yes, but Vim is better than GNU Emacs. – Ian Abbott Aug 03 '23 at 16:36
  • I don't have an old Mac to test what its C library did with newline mapping for text files. In theory, reading a text file should have mapped the `'\r'` to `'\n'` on input and done the reverse on output. It's mostly moot now. Indeed, you could probably ignore the `'\r'` format altogether — it is very unlikely you'll come across such files in practice. It's been over 20 years or so since macOS X 10.0 Cheetah was released, so any old MacOS files are probably at least that old. I'd use a preprocessor to convert such files to Unix (LF) line endings: `tr '\r' '\n' < old > new` would do that job. – Jonathan Leffler Aug 03 '23 at 16:37
  • I am not using the I/O library. I am using the POSIX library. `read`, `write`. Does the appropriate text transformation for the given OS still occur? – user129393192 Aug 03 '23 at 17:07
  • Open the file in `"b"` binary mode, and ascertain the line type from the line endings. Then when a user adds more lines, you can use the appropriate line ending for the file type when the user saves the file (unless they want to change to another type). – Weather Vane Aug 03 '23 at 17:15
  • I just stated I am not using the `stdio` library. @WeatherVane – user129393192 Aug 03 '23 at 17:17
  • No matter. Open the file in binary mode. Then you are in complete control of the line endings. – Weather Vane Aug 03 '23 at 17:18
  • @WeatherVane https://man7.org/linux/man-pages/man2/open.2.html – user129393192 Aug 03 '23 at 17:19
  • It should not matter *how* you open the file. Open it in binary mode, and handle the line endings yourself. – Weather Vane Aug 03 '23 at 17:20
  • You keep repeating that, but in the end, I am doing something like `read(STDIN_FILENO, buf, 1) && buf != '\n'`, am I not? I am asking about the `\n` character and whether that will be automatically translated. You also keep repeating yourself. Binary mode is a `stdio` facility, POSIX is meant to be the raw bytes. – user129393192 Aug 03 '23 at 17:23
  • 'Binary mode' *means* raw bytes. Read each character and determine whether it is 10, 13, a combination, etc. As one answer says: you should cater for all eventualities, not rely on what `\n` may or may not be re the file content. – Weather Vane Aug 03 '23 at 17:33

2 Answers2

2

If you want your editor to be versatile and handle files originating from various systems in a sensible way, you should accept all 3 possibilities for the line ending sequence: a single LF byte for unix systems, including linux and OS/X, a single CR byte for files created on the older macOS versions, and the sequence CR+LF for files produced on Microsoft Windows, MS/DOS and the original CP/M system.

You could autodetect the end-of-line flavor by scanning the beginning of the file: if you find CR+LF sequences, you have a windows file, if you have CR bytes not followed by LF, is an oldmac file, if you have LF bytes, it is a unix file, if none of the above are present, it is either a binary file or a single line text file without a line ending. For these, use the default for the current execution platform.

Whether you preserve the line ending flavor for modifications or you convert the line ending sequences to the local flavor is a design decision. In Quick Emacs, I chose to preserve the line ending flavor and let the user perform the conversion on demand with specific commands.

In any case, you should open the file in binary mode for both reading and writing and handle the line endings explicitly in your program.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
0

Should I expect a CR (old MacOS) or LF (new MacOS and Linux) or should I ever expect a CR+LF (Windows) when parsing the raw text files?

Do you have any reason to believe no user of your software will ever create a file containing any arbitrary sequence of bytes and then ask your software to operate on it?

Do you have any reason to believe no user of your software will ever copy a file from some foreign system or ancient storage medium onto their main system and then ask your software to operate on it?

What do you want your software to do when the user does that?

To answer directly, should you expect it? No, because you do not know what a user will do. Should you prepare for it? Yes, because you do not know what a user will do.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • So my question was, what should I count on the newline character `\n` being? I am asking in terms of encoding, if that wasn't clear. It seems like the `stdio` library has special facilities for translating `\n` to `CR + LF` on Windows, or to the appropriate measure for any OS, but I am dealing in the POSIX `read` `write` libraries, so that is my question, of whether I should explicitly also be checking `\r`, which will, in turn (I assume), would correspond to `CR`. I am wondering if this behavior is portable, of assuming this mapping of `\n` -> `LF` and `\r` ->`CR`. That is my question. – user129393192 Aug 03 '23 at 17:27
  • What is unclear? Yes means yes. The answer to the question about whether you should prepare for various things is yes. Count on nothing. Prepare for everything. – Eric Postpischil Aug 03 '23 at 17:31
  • No, that's not my question. My question is: does `\n` map to `LF` and `\r` to `CR` in terms of *encoding*. Either you think it is obvious, or you don't see that that was my question. Either way, it was not explicitly answered. The question was not if I should prepare for various things. – user129393192 Aug 03 '23 at 17:45
  • Map where? You say you are going to use POSIX `read` and `write`. They do not map text streams the way the C standard library routines do. If you use the C standard library text streams, they map between the C model with `\n` characters terminating lines and some representation of lines in the host environment, which is technically specific to the C implementation. What “raw text files” are you going to parse and with what routines? Are they raw text files created solely by text-file tools on the host system? – Eric Postpischil Aug 03 '23 at 17:54
  • Ok. I'll try to be more explicit, perhaps my choice of words was poor. If I do `read(STDIN_FILENO, buf, 1) && buf == '\n'`, and this statement evaluates true, did I just receive a `LF`? Likewise, if I do `read(STDIN_FILENO, buf, 1) && buf == '\r'` and this is true, is that always a `CR`? – user129393192 Aug 03 '23 at 18:00
  • @user129393192: `read` and `write` do not do any mappings. If `read` reads `'\n'`, there is a new-line character in the file, also called “LF”. If it reads `'\r'`, there is a carriage return character in the file. – Eric Postpischil Aug 03 '23 at 18:02
  • Understood. Thank you. It is just the confusion here where I thought newline doesn't actually mean `LF`, but is system dependent and may mean `CR+LF` (this was the point I was confused on -- what does newline mean?). If you update the answer to include this (which was my question -- though perhaps ill-formed), I will accept it. – user129393192 Aug 03 '23 at 18:04