fgetpos() behaviour depends on newline character

Question

Consider these two files:

file1.txt (Windows newline)

abc\r\n
def\r\n

file2.txt (Unix newline)

abc\n
def\n

I've noticed that for the file2.txt, the position obtained with fgetpos is not incremented correctly. I'm working on Windows.

Let me show you an example. The following code:

#include<cstdio>

void read(FILE *file)
{
    int c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);

    fpos_t pos;
    fgetpos(file, &pos); // save the position
    c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);

    fsetpos(file, &pos); // restore the position - should point to previous
    c = fgetc(file);     // character, which is not the case for file2.txt
    printf("%c (%d)\n", (char)c, c);
    c = fgetc(file);
    printf("%c (%d)\n", (char)c, c);
}

int main()
{
    FILE *file = fopen("file1.txt", "r");
    printf("file1:\n");
    read(file);
    fclose(file);

    file = fopen("file2.txt", "r");
    printf("\n\nfile2:\n");
    read(file);
    fclose(file);

    return 0;
}

gives such result:

file1:
a (97)
b (98)
b (98)
c (99)


file2:
a (97)
b (98)
  (-1)
  (-1)

file1.txt works as expected, while file2.txt behaves strange. To explain what's wrong with it, I tried the following code:

void read(FILE *file)
{
    int c;
    fpos_t pos;
    while (1)
    {
        fgetpos(file, &pos);
        printf("pos: %d ", (int)pos);
        c = fgetc(file);
        if (c == EOF) break;
        printf("c: %c (%d)\n", (char)c, c);
    }
}

int main()
{
    FILE *file = fopen("file1.txt", "r");
    printf("file1:\n");
    read(file);
    fclose(file);

    file = fopen("file2.txt", "r");
    printf("\n\nfile2:\n");
    read(file);
    fclose(file);

    return 0;
}

I got this output:

file1:
pos: 0 c: a (97)
pos: 1 c: b (98)
pos: 2 c: c (99)
pos: 3 c:
 (10)
pos: 5 c: d (100)
pos: 6 c: e (101)
pos: 7 c: f (102)
pos: 8 c:
 (10)
pos: 10

file2:
pos: 0 c: a (97) // something is going wrong here...
pos: -1 c: b (98)
pos: 0 c: c (99)
pos: 1 c:
 (10)
pos: 3 c: d (100)
pos: 4 c: e (101)
pos: 5 c: f (102)
pos: 6 c:
 (10)
pos: 8

I know that fpos_t is not meant to be interpreted by coder, because it's depending on implementation. However, the above example explains the problems with fgetpos/fsetpos.

How is it possible that the newline sequence affects the internal position of the file, even before it encounters that characters?

Btw changing "rt" to "rb" fixes the problem but it's not a solution for me, because I have to read some numbers and strings from the file using fscanf as well. — miloszmaki, Mar 26 '13 at 23:11
Can you add another line of text to your two files. I suspect that it will "correct itself", in the sense that `\r` isn't counted as a character, but when you get to the second line, the position will jump one character. In other words, `\r\n` counts as one unit, which is "1 character big". Your second par doesn't make sense. You haven't even reached the newline, and yet you are saying it goes wrong - I suspect you meant to show something else? — Mats Petersson, Mar 26 '13 at 23:15
I'm just saying that for file2.txt, the position is not incremented after the first character is read, which leads to wrong navigation with fsetpos as shown in the last part of code. The internal position of the file is going wrong even before it encounters the newline character. — miloszmaki, Mar 26 '13 at 23:19
I'm not able to reproduce your behavior -- when I read `file2.txt`, the `pos` goes 0, 1, 2, 3, 4 as expected. In any case, `fpos_t` values should always be treated as opaque data blobs when dealing with text files -- due to the CRLF translation that takes place in text mode, the only 100% portable way to use `fsetpos`/`fgetpos` is to only call `fsetpos` with a value returned by a prior call to `fgetpos` to reset the file pointer to an earlier value. — Adam Rosenfield, Mar 26 '13 at 23:22
Sure, I'm only calling fsetpos with the value obtained from fgetpos. Although I noticed it's wrong for file2.txt and then I've checked the position and noticed it is not incremented for the first character. Weird. — miloszmaki, Mar 26 '13 at 23:23
You can `fscanf()` with mode `"rb"` (also note `"t"` in the mode is not defined by the C Standard). — pmg, Mar 26 '13 at 23:27
Ehm, but HOW the heck does the C runtime know that "in a few characters, I'll have a newline consisting of `\n` instead of `\r\n`, so I'll not increment here?". Sure, it could be a bug in the implementation, but it just seems very wrong. (By the way, why is this tagged C++, `FILE *` functions are surely plain C, not C++) — Mats Petersson, Mar 26 '13 at 23:30
In the output you've shown there's no issue - the characters are being read correctly -- only the `pos` value is different, which the standard says is fine. I'm not sure about the second part, as you haven't shown actual output there. Can you provide the actual output for the second part? (i.e. print the characters) — teppic, Mar 26 '13 at 23:31
I get the correct (same for both) results with gcc in Linux (though everything's binary there). Does it happen if you use `"r"` rather than `"rt"`? — teppic, Mar 26 '13 at 23:43
Yes, I'm getting the same wrong results using `"r"`. I've created `file1.txt` and `file2.txt` using Notepad++ (there's an option to use Unix/Windows format for newlines). — miloszmaki, Mar 26 '13 at 23:48
No. With two lines and the third empty it's even worse - the `pos` goes 0,-1,0,1,2,... for the Unix file. — miloszmaki, Mar 27 '13 at 00:18
@miloszmaki What exactly is wrong with opening the file in binary mode? You'll still be able to call `fscanf` and friends, and Unix files will work perfectly. — user4815162342, Mar 27 '13 at 07:43

score 3 · Accepted Answer · answered Mar 27 '13 at 00:20

3

I would say the problem is probably caused by the second file confusing the implementation, since it's being opened in text mode, but it doesn't follow the requirements.

In the standard,

A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character

Your second file stream contains no valid newline characters (since it looks for \r\n to convert to the newline character internally). As a result, the implementation may not understand the line length properly, and get hopelessly confused when you try to move about in it.

Additionally,

Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment.

Bear in mind that the library will not just read each byte from the file as you call fgetc - it will read the entire file (for one so small) into the stream's buffer and operate on that.

answered Mar 27 '13 at 00:20

teppic

8,039
2
24
37

Does it mean `fgetpos`/`fsetpos` are not guaranteed to work correctly for file with Unix newlines on Windows ? – miloszmaki Mar 27 '13 at 00:29
1

@miloszmaki - my understanding is that in Windows, text mode requires Windows-style text files for things to work correctly. The standard says text files in other formats may have to be modified in order to use in a text stream. – teppic Mar 27 '13 at 00:31
So the best solution would be to replace alone `\n` characters with `\r\n` sequence for files with different formatting? – miloszmaki Mar 27 '13 at 00:33
@miloszmaki - yes, if you convert the Unix files to Windows text format, you'll have no problem at all. – teppic Mar 27 '13 at 00:34
Is there a way to do this at runtime while using `FILE`? – miloszmaki Mar 27 '13 at 00:35

score 2 · Answer 2 · edited May 23 '17 at 12:07

I'm adding this as supporting information for teppic's answer:

When dealing with a FILE* that has been opened as text instead of binary, the fgetpos() function in VC++ 11 (VS 2012) may (and does for your file2.txt example) end up in this stretch of code:

// ...

if (_osfile(fd) & FTEXT) {
        /* (1) If we're not at eof, simply copy _bufsiz
           onto rdcnt to get the # of untranslated
           chars read. (2) If we're at eof, we must
           look through the buffer expanding the '\n'
           chars one at a time. */

        // ...

        if (_lseeki64(fd, 0i64, SEEK_END) == filepos) {

            max = stream->_base + rdcnt;
            for (p = stream->_base; p < max; p++)
                if (*p == '\n')                     // <---
                    /* adjust for '\r' */           // <---
                    rdcnt++;                        // <---

// ...

It assumes that any \n character in the buffer was originally a \r\n sequence that had been normalized when the data was read into the buffer. So there are times when it tries to account for that (now missing) \r character that it believes previous processing of the file had removed from the buffer. This particular adjustment happens when you're near the end of the file; however there are other similar adjustments to account for the removed \r bytes in the fgetpos() handling.

I am not a lawyer, and this may fall under Fair Use, but the source code to the Microsoft CRT is not free to redistribute (at least in VS 2010; I'm assuming that VS 2012 is the same). The license in it is "Copyright (c) Microsoft Corporation. All rights reserved.". — Adam Rosenfield, Mar 28 '13 at 18:08
I assume that this small snippet would fall under fair use. If there's disagreement about this, I'm perfectly fine with deleting. — Michael Burr, Mar 28 '13 at 18:24

fgetpos() behaviour depends on newline character

2 Answers2

Linked