8

I am reading an ASCII text file. It is defined by the size of each field, in bytes. E.g. Each row consists of a 10 bytes for some string, 8 bytes for a floating point value, 5 bytes for an integer and so on.

My problem is reading the newline character, which has a variable size depending on the OS (usually 2 bytes for windows and 1 byte for linux I believe).

How can I get the size of the EOL character in C++?

For example, in python I can do:

len(os.linesep)
fuz
  • 88,405
  • 25
  • 200
  • 352
jramm
  • 6,415
  • 4
  • 34
  • 73
  • 4
    If you're opening the file in text mode, newlines should always just be `'\n'`, whatever the native line ending is. Do you really need to know the size of the native EOL string? – Badministrator Jan 05 '16 at 07:43
  • Is the file guaranteed to have been saved under the same OS as the one your code that reads it runs on? If yes, simply open the file in text (not binary) mode. – dxiv Jan 05 '16 at 07:45

2 Answers2

1

The time honored way to do this is to read a line.

Now, the last char should be \n. Strip it. Then, look at the previous character. It will either be \r or something else. If it's \r, strip it.

For Windows [ascii] text files, there aren't any other possibilities.

This works even if the file is mixed (e.g. some lines are \r\n and some are just \n).

You can tentatively do this on few lines, just to be sure you're not dealing with something weird.

After that, you now know what to expect for most of the file. But, the strip method is the general reliable way. On Windows, you could have a file imported from Unix (or vice versa).

Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • Half a nitpick, but it's hard to `read a line` without knowing beforehands what the line terminator is. For example, your recipe fails for `\r` line terminators, and also for consecutive empty lines saved as `\r\n\n\n` which have been sighted in windows-land. – dxiv Jan 05 '16 at 08:05
  • 1
    @dxiv The method works against `\r\n\n\n` (e.g. `\r\n \n \n`)--that's just mixed mode as I mentioned [consecutive is non-issue]. I haven't seen a `\r` only file in 20+ years [if ever, and I've converted 1000's of files]. Not readable by many programs as they now assume [at least] newline. Try DOS `type file` on one ;-) I don't think even MS supports them anymore. '\r' is valid [as non-terminator] at the _beginning_ of a line (e.g. captured progress output). I've seen much more of that (e.g. `\rpgm is 56% done\rpgm is 57% done`) – Craig Estey Jan 05 '16 at 08:45
  • @CraigEstey - Old school Mac files are \r only. See wikipedia: https://en.wikipedia.org/wiki/Newline – user3690202 Jan 11 '16 at 09:21
  • @user3690202 I guessed as much, but, this is beyond the scope of OP's question. Such a file would need to be converted upon import to the [NTFS] FS to be usable under WinX--so OP would never see them raw. They can be auto-detected/converted, but it's better to just "know" [via cmd line option]. The fastest way to do line reads is via `mmap` (See my answer: http://stackoverflow.com/questions/33616284/read-line-by-line-in-the-most-efficient-way-platform-specific/33620968#33620968), so easy enough to prescan first, but hardly worth the extra effort in 99.44% of cases. – Craig Estey Jan 11 '16 at 22:56
  • @CraigEstey - There are many ways I can think of to get CR terminated text files. You could boot a windows machine using a linux boot disk and copy files from an old drive, etc. Point is - nowhere does the OP mention windows, copying a file onto a windows machine doesnt "import to the FS", heck Vim can generate CR line ending text files on a windows machine if you really wanted. It doesn't seem "beyond the scope" of the question - indeed it seems the entire point of the question, a point that you have missed. – user3690202 Jan 12 '16 at 08:41
  • @user3690202 I've missed nothing my friend. vim [under windows] will generate `\r\n` [vim calls it "dos mode"] and I covered that mixed mode case in my post. You can turn dos mode on/off on either system. That is _different_ than `\r` only--which is malformed on WinX/unix and must be converted before any common/sane program can use them. OP _does_ mention windows--reread question. Time to move on ... – Craig Estey Jan 12 '16 at 09:07
  • @CraigEstey I think you need to learn how to use Vim, and learn how line endings work at the same time. http://vim.wikia.com/wiki/File_format set file format to mac and everything works fine. Utter nonsense what you say about it being "malformed"/ Nevermind, people like you don't have the ability to learn. Maybe move on to a textbook - 20 years experience, hah, must have missed MacOS 9 then, eh? – user3690202 Jan 12 '16 at 18:45
0

I'm not sure that the translation occurs where you think it is. Look at the following code:

ostringstream buf;
buf<< std::endl;
string s = buf.str();
int i = strlen(s.c_str());

After this, running on Windows, i == 1. So the end of line definition in std is 1 character. As others have commented, this is the "\n" character.

user3690202
  • 3,844
  • 3
  • 22
  • 36
  • This code is wrong because CRT lib doesn't turn `\n` into `\r\n` for in-memory buffers, but it does so for files and console. – Serge Rogatch Jan 05 '16 at 07:49
  • Here you are demonstrating the problem I am up against. C++ will convert "\n" into the os-specific character when writing to a file/console, but not to a buffer. – jramm Jan 05 '16 at 08:03
  • @jramm I don't think you explained your problem well enough yet. `\n` doesn't need to (and in fact couldn't) be encoded whatsoever when written to a buffer. But _when_ you write that buffer to a file opened in *text* mode, the `\n` will be translated automatically to whatever the platform mandates. Then if you open the same file in _text_ mode and read it back, the newline sequence will be translated back to `\n`. So, to me at least, it's not clear why you need to know the encoding of `\n` in the file on disk. – dxiv Jan 05 '16 at 08:14