3

Why windows "notepad" can't read specific "new lines" while "notepad++" can read them?

well, that's not the issue. My problem is with "std::ifstream::getline" which read all till it encounters "those new lines which only recognized by windows notepad" for example: "windows notepad" would read as follow:

12345
67890

notepad++ would read as follow:

1
2
3
4
...

and "std::ifstream::getline" would get "12345" ?!!!

I need to parse csv files by std::fstream and csv new row is like that new line of notepad++. So, is there any function or to make generic function that can read those new lines?

Tito Tito
  • 248
  • 3
  • 10
  • 3
    You might be experiencing UNIX style line endings versus Windows style line endings. Notepadd++ should be able to read UNIX style endings, while Notepad only displays Windows new line endings. Here's some more info: http://stackoverflow.com/questions/419291/historical-reason-behind-different-line-ending-at-different-platforms – austin Sep 20 '13 at 00:17
  • @austin: I would guess that Tito Tito opened the file in text mode in which case the end of line sequence would get conflated into `'\n'` when opened on a Windows system, i.e., it would be indistinguishable from the embedded newlines. Admittedly, there is too little context to tell for sure, though. – Dietmar Kühl Sep 20 '13 at 00:35

3 Answers3

12

There are 3 common line ending styles composed of the \n ("line-feed", or "newline") and \r ("carriage return") characters:

  • \r\n : Windows style
  • \n : UNIX style (Including Mac OSX)
  • \r : Mac style (pre-OSX)

Almost every program that deals with text will accept any one of these as a newline. I say almost because native Windows controls do not. Notepad is simply a Win32 Text Area control wrapped in a window frame. This means that you have to manually use Windows-style line endings when using text with win32. Not just Notepad, but also if you have a multi-line string in a Win32 popup, for example, you have to make sure you use \r\n else you'll get everything on one line.

Most good text editors will have a setting somewhere for which line ending to use when saving. There are also command-line utilities like dos2unix or unix2dos that convert a text file from one to another.


Historical note:

ASCII and text terminals came about when the terminal was simply an electronic typewriter. The Carriage Return (CR) character \r meant put the printer carriage back to the beginning of the same line. Line Feed (LF) character \n meant move the paper up one line. The Windows philosophy is that to start a new line you must do both: CR LF.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
Adam
  • 16,808
  • 7
  • 52
  • 98
  • I suspect many Unix programs would not react well to the old-Mac `\r` line endings. They 'handle' Windows `\r\n` endings because they treat them as a newline that happens to be preceded by a `\r`, which can cause problems if you have, for example, regexes that are trying to match end of line (for example, looking for a letter at the end of the line — but seeing the `\r` before the end of line as not matching). – Jonathan Leffler Sep 20 '13 at 00:59
  • @JonathanLeffler I don't think Unix programs handle the Windows line endings by accident. Text files get passed between OSes quite frequently and this would surface as a bug very quickly. The standard libraries for popular languages already do the right thing for you, so this only pops up if you parse manually. – Adam Sep 20 '13 at 01:30
  • It's also not uncommon to have multiple line ending styles in the same text file. I've had to make my parsers handle this as well. For example, a file may start out on OSX and have a few lines edited on Windows. Now you have `\n` and `\r\n`. – Adam Sep 20 '13 at 01:32
  • I beg to differ on the assertion that standard libraries do the right thing. The Unix `gets()`, `fgets()` functions (and the `puts()` function on output, and POSIX `getline()`) strictly deal with `\n`; they don't care whether there's a `\r` before it and certainly never add one on output (or remove one on input). As for `\r` only, they input functions will ignore `\r` and continue looking for `\n` (until they run out of space in the case of `fgets()`, or until EOF for `gets()`). Even `getdelim()` only deals in a single delimiter that must be specified in advance. – Jonathan Leffler Sep 20 '13 at 04:25
  • I did more snooping and it looks like you're right. I guess the files that worked for me before were pure luck. At least the text editors are smart about it, and the windows line endings USUALLY are ok, likely because in most cases whitespace is trimmed anyway. – Adam Sep 20 '13 at 09:35
6

First off, there is only one kind of newlines: '\n'. However, on systems there is a line end sequence consisting of a new line and a carriage return ("\n\r") or a carriage return and newline ("\r\n") (these made some sense with printers using a head writing characters: sending a newline would move to the next line but staying otherwise at the position and sending a carriage return would move the head to start of the line). From the looks of it, you have a file using newlines and carriage returns for different purposes but reading the file in text mode conflate the end of line sequence. Part of the mystery can probably be addressed by opening the file in binary mode, i.e., adding the flag std::ios_base::binary when opening the file.

That would't change the behavior of std::getline(), however: this function reads up to the first line termination character which is by default newline ('\n'). To read lines up to a different character you'd pass it as additional parameter (I'm using the non-member function as it deals with arbitrary long strings rather than the member function reading char array; the member function could be used similarly):

std::ifstream in("file.csv", std::ios_base::binary);
for (std::string line; std::getline(in, line); ) {
    std::istringstream sin(line);
    for (std::string field; std::getline(sin, field, '\r'); ) {
        std::cout << "field='" << field << "'\n";
    }
}

Based on your description it seems your file uses '\r' as a field separator. It may be something different which is probably easiest to find by opening the file in binary mode and then printing the individual characters together with their respective code:

std::ifstream in("file.csv", std::ios_base::binary);
for (std::istreambuf_iterator<char> it(in), end; it != end; ++it) {
    std::cout << std::setw(3)
              << int(static_cast<unsigned char>(*it)) << ' ' << *it << '\n';
}

This will just print each character's code and the character itself. You should be able to find the value of the field separators but I'd guess '\r' is being used.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
0

Different platforms have different conventions for how line endings are indicated in text files. When you write the character \n in your program you are asking the standard library to write or read whatever character(s) comprise a line ending on your system. If you have a text file that was written with standard tools on one system and you move it to another system you must change the line endings to match the new system. FTP in text mode will do this. If you just copy bytes you run the risk of having a text file that doesn't respect local conventions and won't be readable. (Try running a Windows-generated makefile through gnu make on a Unix system...). Some standard libraries are better at sorting out unconventional files than others, but if you need to move text files from one system to another, you need to respect local conventions and do the proper conversion outside of your program.

Pete Becker
  • 74,985
  • 8
  • 76
  • 165