Why does 'new line' offset the all the characters byte position in a .txt file +1?

Question

When I use fstream::tellg, after reading in the first character with fstream::get (char) the result is: 1

I then insert a 'new line' after the first character

I fstream::seekg to the beginning: 0

When I use fstream::tellg, after reading in the first character this time the result is: 2

If I insert: "abc", into a .txt file:

after reading "a" tellg will give: 1
after "b" 2
and after "c" 3.

But if I insert: "abc\n" or "abc" << endl;:

after reading "a" tellg will give 2
after "b" 3
after "c" 4
lastly 5 after the new line.

What is the reason for this?

I understand that 'newline' is characters too. What I do not understand is the offsetting of the tellg result after reading a character. With each use of 'newline' this offset is incremented by one.

Update

Conclusion: There was a problem with my IDE setup! I have been using Code::Blocks. I tried building the program in Microsoft Visual Studio IDE and it ran with no trace of the problem. This does not mean that Code::Blocks is broken. It might have been an issue in my Code::Blocks settings. I have no recollection of changing anything. Even if that was the case; I, in my humble opinion, do not think it is right that you can change this sort of thing by accident. I am disappointed in Code::Blocks.
mySolution: Change IDE

Why do you need to "work around" it? The newline character is an actual character and it takes up one byte like any other character. It's value is 0x0A in ascii. You might also see carriage returns, which are also a byte, before the newline, depending who wrote the file and what OS you are reading it from. Your code should expect this. What are you expecting and how does it interfere with what you are trying to do? — Christopher Pisz, Oct 10 '18 at 21:17
This may be a question about _writing text_ vs _writing binary_ data, but it's impossible to know, since no problem has been described here. — Drew Dormann, Oct 10 '18 at 21:23
Why don't you look at the bytes that were written to the file? Open the file in a hex viewer and have a look. — rustyx, Oct 10 '18 at 21:36
@rustyx I have used a hex viewer, it was very interesting but i did not find anything strange; every character was in its position. However i did discover one thing. When i fstream::get at the 'newline' (\r\n) position, it increments the result of fstream::tellg only by one. This might be why all other characters are offset when using fstream::tellg. But i hope this hypothesis is wrong. — kevin kangaji, Oct 11 '18 at 12:32

Jerry Coffin · Answer 1 · 2018-10-12T16:07:21.623

3

My guess is you're writing code on a Microsoft OS.

In text files, Microsoft OSes (and associated software) expect the end of a line to be marked with a \r\n sequence, so when you write a new-line to a (text) file, it gets translated from \n to \r\n. So, even though you only inserted one character into the stream, that resulted in two characters being written to the external file.

If you're concerned with ensuring that the content of the external file exactly match what you inserted into the stream, that may indicate that you want what the C++ standard library would consider a binary file, which you'd get by specifying std::ios::binary when you open the file.

Now, it is true that when you deal with a text file, tellg doesn't produce a very meaningful number. What we have is something like this:

The upper side is the data as you see it. The lower side is the data as it's stored in the file. When you call tellg, it's telling you the position along the lower side-that is, the position relative to the start of the file. But, depending on how many \r\n pairs there are before that in the file, that may result in a different number of characters in the upper row, which is what you'll see when you read the data from the file.

What this means it that the result from tellg can only be used in a few fairly specific ways--mostly, when you get a number from tellg, you can give that number back to seekg, and start reading from the same place.

As far as your code goes, I guess I don't see what I understand your question to be saying. I rewrote the code a bit to show the results together:

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <string>

using namespace std;

std::string show(char x) {
    if (x > 32)
        return std::string(1, x);
    else switch (x) {
    case '\r': return "<\\r>";
    case '\n': return "<\\n>";
    case '\t': return "<\\t>";
    default: return "<BAD>";
    }
}

void display_txt_file(fstream& file)
{
    file.seekg(0, ios_base::beg);
    char x;
    cout << "tellg: " << file.tellg() << "| ";
    while (file.get(x))
    {
        cout << "'" << show(x) << "' tellg: " << file.tellg() << "| ";
    }
    file.clear();
    file.seekg(0, ios_base::end);
    std::cout << "\n";
//    cout << "\n> " << file.tellg() << "\n" << endl;
}

int main(int argc, char* argv[])
{
    ofstream new_file;
    new_file.open("test.txt");
    new_file.close();

    fstream file("test.txt", ios::in | ios::out);
    if (!file.is_open())
    {
        cout << "error file not opened" << endl;
        return 0;
    }

    file << "ABCD";
    display_txt_file(file);

    file.seekp(0);

    file << "ABCD\nE";
    display_txt_file(file);

    return 0;
}

When I run this on Windows, I get the following output:

tellg: 0| 'A' tellg: 1| 'B' tellg: 2| 'C' tellg: 3| 'D' tellg: 4|
tellg: 0| 'A' tellg: 1| 'B' tellg: 2| 'C' tellg: 3| 'D' tellg: 4| '<\n>' tellg: 6| 'E' tellg: 7|

So, everything up to the new-line matches, exactly as we'd expect. Then the new-line gets expanded to two characters, followed by the E. But, after we read the 'A', tellg has returned 1, not 2, as was claimed in the question.

edited Oct 12 '18 at 16:07

answered Oct 10 '18 at 21:16

Jerry Coffin

476,176
80
629
1,111

Opening a file in binary mode does not make bytes disappear. It is also quite a different thing then reading text. Your answer makes it sound like ios::binary magically undoes carriage return line feeds on microsoft operating systems, which is not the case or its use. Better the OP actually understand what carriage return and newline characters are in text mode, and that they take up actual bytes, as well as differences in character encodings and how fstream handles them by default. – Christopher Pisz Oct 10 '18 at 21:22
2

If somebody cares about how a new-line shows up in the file, rather than about being able to write a text file with lines the way text files are expected to be formatted on that platform, then pretty much by definition he's dealing with a binary file, and opening in binary format is the (only) correct way to deal with it. – Jerry Coffin Oct 10 '18 at 21:26
Your comment is ambiguous. Who is "somebody" and what do they "care" about, and who decided what is "expected", and what is this "definition?" It is quite irresponsible to go and tell the OP to use binary mode with no explanation of binary mode vs text mode. I don't even see anything in the OP's post about writing at all. I think you must be making a number of bad assumptions about what the OP is trying to do, and you are surely skipping over several paragraphs of explanation of text vs binary modes, what character encoding is, and how fstream encodes things by default. – Christopher Pisz Oct 10 '18 at 21:30
If I was new to C++, I'd read your answer to mean, "The ios::binary flag deletes newline characters in my fstream", which is absolutely incorrect. – Christopher Pisz Oct 10 '18 at 21:34
1

@ChristopherPisz: How could: "when you write a new-line to a (text) file, it gets translated from \n to \r\n. To stop that from happening, you normally want to specify std::ios::binary mode when you open the file." possibly be read as meaning anything will be deleted from a file? If you don't see anything in the OPs question about writing at all, you must not have read it at all. How would the "insert" in: "I then insert a 'new line' ..." mean anything other than writing? – Jerry Coffin Oct 10 '18 at 21:39
i understand that 'newline' also use memory but inserting other characters do not offset every other character +1 why do i see this when inserting 'newline' – kevin kangaji Oct 11 '18 at 08:24
@JerryCoffin Thank you for taking your time. I know what binary mode is. What i do not understand is this offsetting of all the fstream::tellg results after \r\n has ben used. I have discovered that when i fstream::get at \r\n, witch would be fstream::get at the last position of the row, fstream::tellg only reports that as being one byte position. Another fstream::get then gets the first viewable character of the row. – kevin kangaji Oct 11 '18 at 12:47
1

@kevinkangaji: Yes, when you read a \r\n in the file, it gets converted and read as only a \n in what you read. fstream::tellg just tells you how many bytes you are from the beginning of the file, which may not be the same as the number of times you read from the file to get to that point. – Jerry Coffin Oct 11 '18 at 14:00
@JerryCoffin: looking at the illustration you provided; "e" is definitely offset. It is the 6th inserted yet 7th by bit. "a" to "b" is unchanged in the illustration because they appear before this \n conversion. But i see this for every character from the first character to the last character, no mater where the \n is inserted. – kevin kangaji Oct 11 '18 at 15:32
@kevinkangaji: In that case, I think we're going to need to see code showing what you're seeing--it doesn't fit with anything I've ever seen nor anything I'd expect to see. – Jerry Coffin Oct 11 '18 at 15:35
@kevinkangaji: You're using a combination of `seekg` (which moves the "get" pointer--i.e., the reading position) and `tellp` (which tells you the "put" pointer--the writing position). You're looking at two numbers that aren't related at all. – Jerry Coffin Oct 12 '18 at 06:20
@JerryCoffin: In my experience, moving "put" also moves "get". But i digress, it is not appropriate of me to show code that do not represent my question, i apologize. I have edited the code so that it is adequate. The same result can, however, be observed. – kevin kangaji Oct 12 '18 at 13:40
@kevinkangaji: I've edited the answer to include a version of your code, mildly modified to show that the claimed effect doesn't happen. – Jerry Coffin Oct 12 '18 at 16:12
@JerryCoffin: I ran the code on 2 different machines and got the same result that i did before. After "A", tellg returns 2 in the second pass. This seems to have to do with setup. This is not good. – kevin kangaji Oct 12 '18 at 17:02
@JerryCoffin: I solved the problem by installing and using Microsoft visual studio to run the code instead of Code::Blocks, which i have been using up until this point. – kevin kangaji Oct 12 '18 at 18:05
In text mode `tellg` doesn't necessarily represent a file offset (its only requirement is that it can be used to restore that position with `seekg`) – M.M Oct 13 '18 at 22:16

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

Update

Conclusion: There was a problem with my IDE setup! I have been using Code::Blocks. I tried building the program in Microsoft Visual Studio IDE and it ran with no trace of the problem. This does not mean that Code::Blocks is broken. It might have been an issue in my Code::Blocks settings. I have no recollection of changing anything. Even if that was the case; I, in my humble opinion, do not think it is right that you can change this sort of thing by accident. I am disappointed in Code::Blocks.
my Solution: Change IDE

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 13 '18 at 21:45

kevin kangaji

11
5

This proably means you have a bug in your program that happens to work as you expect for now in MSVC – M.M Oct 13 '18 at 22:17
Please post a [mcve], a small application demonstrating the difference. – rustyx Oct 14 '18 at 09:22

Christopher Pisz · Answer 3 · 2018-10-11T00:02:25.543

It is hard to tell what or why you'd be working around anything without an explanation of your expectations and a full code listing.

However, it is important that you understand character encoding when reading and writing to a file.

The newline character takes up a byte. It's value is 0x0A if we are using the ASCII character set. There are other character encodings aside from ASCII. There is also UTF-8 or UTF-16 encodings, for example. Every character encoding might have a different byte, or multibyte, representation for a readable text character, as well as the unreadable text characters, such as the newline.

On Windows, there is a convention to use carriage return followed by line feed, instead of just a line feed. Those two byes would look like 0x0D, 0x0A, in ASCII. On *nix systems there is no such convention.

Therefore, when you are counting bytes in your fstream, you will need to account for the newline character taking up a byte, or two bytes if you are expecting '\r\n', That is , if you are using ASCII encoding.

As far as I know, fstream assumes it's content is ASCII. This might have changed with C++17. I think there were plans to support various character encodings in streams. Those on the cutting edge might be able to comment.

Your operating system has a default character encoding set somewhere in its configuration. I know older Windows machines used Windows-1252. I am not sure what Windows 10 uses. I think most *nix systems use UTF-8. At any rate, you will want to consult your operating system's configuration.

C++ streams are going to want to transform from one to the other when you read and write to file. The transformation of text to it's byte representation are a big part of what streams are trying to do for you.

If you don't want the byte representation that the stream is going to provide, then you can feel free to write bytes yourself, however you wish, in binary mode. However, be mindful of how that effects other readers of the file and what encoding they are expecting.

So, keep in mind who created the file, what it looks like as text, what it's binary representation is, in file, and in memory, and code for it appropriately.

Lucky for us, some encodings also contain the entire ASCII character set, and simply expand on it. UTF-8 is one such encoding that does this.

You can refer to What's the difference between \n and \r\n? for a discussion on that topic.

You can also refer to Difference between files written in binary and text mode

"Standard C++ IOStreams and Locales: Advanced Programmer's Guide and Reference Book by Angelika Langer and Klaus Kreft" is a good book if you want to really get to know your streams inside and out.

Thank you very informative. but still do not understand; if i insert: "abc" , into a .txt file; after reading "a" tellg will give: 1, after "b" 2 and after "c" 3. but if i insert: "abc\n" or "abc" << endl; after reading "a" tellg will give 2, after "b" 3 and after "c" 4, lastly 5 after the new line — kevin kangaji, Oct 11 '18 at 12:49

Why does 'new line' offset the all the characters byte position in a .txt file +1?

Update

3 Answers3

Update