Are newline characters counted twice?

Question

I have written a C program to count the words,characters and lines in a text file. The program is reading lines and words correctly but is not counting the total characters correctly.

I am using GitBash on windows, so I used the wc command for checking my program's correctness. It always shows x characters more than that my program's output, where x is the no. of new line characters in my program.

Here is my program:

#define IN 1 // if getc is reading the word
#define OUT 0 // if getc has read the word and now reading the spaces

int main()
{
    FILE *fp = fopen("lorum ipsum.txt","r");
    int lineCount = 0;
    int wordCount = 0;
    int charCount = 0;
    int c;
    int position = IN; //tells about the reading position of getc whether reading the word or has read the word

    while((c=getc(fp)) != EOF)
    {
        if(c == '\n')
        {
            lineCount++;
        }
        if(c == '\n' || c == '\t' || c==' ')
        {
            if(position == IN) // means just finished reading the word
            {
                wordCount++;
                position = OUT; // is now reading the white spaces  
            }
        }
        else if(position == OUT)
        {
            //puts("This position is reached");
            position = IN; //currently reading the word
        }

        charCount++;
    }

    // printing to output
    return 0;
}

Here the whole code does not matter, what matter is that I am increasing the charCount variable for every character read by getc in the while loop.

Also, I checked for the '\n' character size by using sizeof(), it is just a simple character and occupies 1 byte; so we should count it as one.

Also from the file size I came to know that wc is outputting the correct results. So what is the problem, is there any issue in the encoding in which my text file is stored?

NOTE: Every time I add a newline in my text file by pressing ENTER, the size of the file is increased by two and so as the number of characters counted by the wc command but my program's output characters change by one.

EDIT: According to the good answers I understood that there are extra \r characters at the newline. So when r mode is used it interprets the newlines as \n, only when using the binary mode rb it shows up the actual \r\n. Here is the answer about this behavior: what's the differences between r and rb in fopen

Try `fopen("lorum ipsum.txt","rb");`. By default windows line endings: `\r\n` are converted to unix: `\n`. — Piotr Praszmo, May 25 '14 at 10:23
"i checked for the '\n' character size" -- smart, but not smart enough. Newline type differences lie not in the *size* of this character but in its interpretation inside library functions. — Jongware, May 25 '14 at 10:36
@Jongware, all newlines character will be treated as the newlines formatter as per the ASCII standards;i.e i guess you are saying about interpretation inside library functions — darxtrix, May 25 '14 at 10:40
@Banthar, so i tried your solution. It worked but why r or rt modes are not working. — darxtrix, May 25 '14 at 10:46
If you want to count `\r` characters, you need `"rb"` mode under Windows because otherwise they are filtered out in your program. Of course, that does not happen in showing the total file size in a DIR listing. — Jongware, May 25 '14 at 11:02

score 3 · Accepted Answer · edited May 23 '17 at 12:11

3

Windows new line consists of two characters. One is \r as carriage return and another is \n as line feed. By checking only for \n, you missed \r char.

See What is the difference between \r and \n? for more details.

edited May 23 '17 at 12:11

Community

1
1

answered May 25 '14 at 10:23

Vladimir Kocjancic

1,814
3
22
34

2

What a newline consists of is definitly OS dependend. – alk May 25 '14 at 10:27
1

This is not the case if the file is opened in text mode. – The Paramagnetic Croissant May 25 '14 at 10:34
1

OK!! that's helpful but what should i do read the file in proper format, "r" mode is not reading in correct format. – darxtrix May 25 '14 at 10:45

phuclv · Answer 2 · 2019-07-12T04:15:45.070

There are many ways to end a line. Currently macOS and Linux use just one byte but Windows uses the pair CR-LF because it has been used since CP/M and then carried over to DOS. See

When you open a file in text mode, the C runtime library will automatically convert the system line ending character(s) ('\r\n' in this case) to '\n' and count only once. For example on class Mac where the new line character is '\r' then reading it in text mode will produce '\n'. When printing using printf and some other functions the reverse thing will happen: '\n' will be converted to the system's newline character.

In practice you should generally open the file in text mode except when you want to deal with the line ending yourself (like when you need to open files in various line ending formats on a single platform). That will count the number of lines correctly. But to count the number of bytes you need to open in binary mode. But why take such a hassle when you can simply get the file size directly without any counting?

Are newline characters counted twice?

2 Answers2