1

I have written a C program to count the words,characters and lines in a text file. The program is reading lines and words correctly but is not counting the total characters correctly.

I am using GitBash on windows, so I used the wc command for checking my program's correctness. It always shows x characters more than that my program's output, where x is the no. of new line characters in my program.

Here is my program:

#define IN 1 // if getc is reading the word
#define OUT 0 // if getc has read the word and now reading the spaces

int main()
{
    FILE *fp = fopen("lorum ipsum.txt","r");
    int lineCount = 0;
    int wordCount = 0;
    int charCount = 0;
    int c;
    int position = IN; //tells about the reading position of getc whether reading the word or has read the word

    while((c=getc(fp)) != EOF)
    {
        if(c == '\n')
        {
            lineCount++;
        }
        if(c == '\n' || c == '\t' || c==' ')
        {
            if(position == IN) // means just finished reading the word
            {
                wordCount++;
                position = OUT; // is now reading the white spaces  
            }
        }
        else if(position == OUT)
        {
            //puts("This position is reached");
            position = IN; //currently reading the word
        }

        charCount++;
    }

    // printing to output
    return 0;
}

Here the whole code does not matter, what matter is that I am increasing the charCount variable for every character read by getc in the while loop.

Also, I checked for the '\n' character size by using sizeof(), it is just a simple character and occupies 1 byte; so we should count it as one.

Also from the file size I came to know that wc is outputting the correct results. So what is the problem, is there any issue in the encoding in which my text file is stored?

NOTE: Every time I add a newline in my text file by pressing ENTER, the size of the file is increased by two and so as the number of characters counted by the wc command but my program's output characters change by one.

EDIT: According to the good answers I understood that there are extra \r characters at the newline. So when r mode is used it interprets the newlines as \n, only when using the binary mode rb it shows up the actual \r\n. Here is the answer about this behavior: what's the differences between r and rb in fopen

phuclv
  • 37,963
  • 15
  • 156
  • 475
darxtrix
  • 2,032
  • 2
  • 23
  • 30

2 Answers2

3

Windows new line consists of two characters. One is \r as carriage return and another is \n as line feed. By checking only for \n, you missed \r char.

See What is the difference between \r and \n? for more details.

Community
  • 1
  • 1
Vladimir Kocjancic
  • 1,814
  • 3
  • 22
  • 34
3

There are many ways to end a line. Currently macOS and Linux use just one byte but Windows uses the pair CR-LF because it has been used since CP/M and then carried over to DOS. See

When you open a file in text mode, the C runtime library will automatically convert the system line ending character(s) ('\r\n' in this case) to '\n' and count only once. For example on class Mac where the new line character is '\r' then reading it in text mode will produce '\n'. When printing using printf and some other functions the reverse thing will happen: '\n' will be converted to the system's newline character.

In practice you should generally open the file in text mode except when you want to deal with the line ending yourself (like when you need to open files in various line ending formats on a single platform). That will count the number of lines correctly. But to count the number of bytes you need to open in binary mode. But why take such a hassle when you can simply get the file size directly without any counting?

See also

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • Good, but (1) the OP wants to count both words *and* total number of characters, so `\r` needs to be read; (2) the text file may not originate from a Windows system, so *always* counting a single `\n` as 2 characters is "over-correcting". – Jongware May 25 '14 at 11:17
  • @Jongware I didn't say that always counting a single `\n` as 2 characters, I mean the `'\r\n'` pair. He wants to count the number of characters, not bytes, so he must count the pair as one – phuclv May 25 '14 at 13:23
  • But that is only valid if you are *sure* there are `\r` characters in the input. No way to know *unless* you use `"rb"` (and then you can still count just the `\n` as a single new line, and only add 1 for each `\r` in the "total" character count). – Jongware May 25 '14 at 13:43