0

The following will use lorem.txt as the test file:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

I have the following code meant to count lines, words, and characters in a file (trying to imitate the wc in Linux):

#include <stdio.h>

int main(){
    char data[500032];  // assigns 500KB of space for input string
    if (fgets(data, sizeof data, stdin)) {
        char *ptr = &data[0];  // initializes pointer at first character
        int count = 0;  // total character count
        int d1_count = 0;  // newline count
        int d23_count = 0;  // ' ' and '\t' count

        while (*ptr){
            char d1 = '\n';
            char d2 = ' ';
            char d3 = '\t';
            count++;  // counts character
            if (*ptr == d1){
                d1_count++; // counts newline
            }
            if (*ptr == d2 || *ptr == d3) {
                d23_count++;  // counts spaces or tabs
            }
            ptr++;  // increments pointer
        }
        printf("%d %d %d\n", d1_count, d23_count+1, count-1);
    }
}

In my Linux terminal, I use gcc -o wordc wordc.c to compile and then ./wordc < lorem.txt

However, I get 1 69 445 (1 line, 69 words, and 445 characters). This is the number of lines, words, and characters for the first paragraph only. I am expecting 7 lines, 207 words, and 1342 characters.

I assume what is happening is C stops reading the file once it finds a newline. How do I get it to stop doing this?

As an aside- I feel like assigning 500KB of space for a string is a bit hacky and wasteful. Are there any good ways to assign only as much space as I need?

Any help would be appreciated

Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39
Omaro_IB
  • 379
  • 1
  • 3
  • 13
  • 1
    `fgets` reads a line. If you want to read more lines you'll need a loop. An alternative is to use `fread` to read all or part of the file, process the buffer, and if there's more to read repeat until the end of the file. – Retired Ninja Sep 15 '22 at 18:58
  • 1
    *assigning 500KB of space for a string is a bit hacky and wasteful* Indeed it is. *Are there any good ways to assign only as much space as I need?* You can `malloc` and `realloc` to allocate just enough space. If reading from a file, you can use `stat` (or `fseek`/`ftell` + `rewind`) to find the size in advance. You can allocate an array which is hopefully big enough for one line, then read and process a line at a time. Or, at least for for this particular problem, you can read the file a character at a time, classifying and counting characters as you go, and never allocate a buffer at all. – Steve Summit Sep 15 '22 at 19:08
  • 1
    Or you can use `getline` to read a line at a time, and it will take care of doing the malloc/realloc thing to build a big enough buffer for however long a line it finds. – Steve Summit Sep 15 '22 at 19:10
  • In your question, you state that you want the program to count `7` lines, but you only posted `5` lines of input. Please explain why you want the program to count `7` lines in your posted input. If this is a mistake in your question, then please [edit] the question to fix it. – Andreas Wenzel Sep 16 '22 at 00:22
  • Define **line**, please. There's something mysterious about the text sample you've posted, and I'm "smelling a rat"... Is a **line** defined the appearance of LF, or is it defined by text appearing AFTER a LF? Please add a complete hex listing of some sample text (doesn't have to be large). "Mysterious" ASCII characters and ambiguity are a waste of everyone's time... (cc: @AndreasWenzel) – Fe2O3 Sep 16 '22 at 10:57

2 Answers2

3

Change the line

if (fgets(data, sizeof data, stdin)) {

to

while (fgets(data, sizeof data, stdin)) {

so that you are reading one line per loop iteration.

You will also have to move the lines

int count = 0;  // total character count
int d1_count = 0;  // newline count
int d23_count = 0;  // ' ' and '\t' count

outside the loop, because you want to remember these values between loop iterations.

You will also want to move the line

printf("%d %d %d\n", d1_count, d23_count+1, count-1);

outside the loop if you only want to print that line only once, instead of once per loop iteration.

I feel like assigning 500KB of space for a string is a bit hacky and wasteful. Are there any good ways to assign only as much space as I need?

The buffer must only be sufficiently large to store a single line. It does not have to store the entire file at once. Therefore, it would probably be sufficient to use a significantly smaller buffer.

Although it would be possible to use a dynamically allocated buffer (using malloc) and resize the buffer as necessary (using realloc), in this case, it is probably not necessary.

Since you stated in the question that you are using Linux, an alternative would be to use the POSIX-specfic function getline, which handles most of the memory management for you.

I have rewritten your program to use getline:

#include <stdio.h>
#include <stdlib.h>

int main() {
    char *data = NULL;
    size_t data_capacity = 0;
    int count = 0;  // total character count
    int d1_count = 0;  // newline count
    int d23_count = 0;  // ' ' and '\t' count

    while ( getline( &data, &data_capacity, stdin ) >= 0 ) {
        char *ptr = &data[0];  // initializes pointer at first character

        while (*ptr){
            char d1 = '\n';
            char d2 = ' ';
            char d3 = '\t';
            count++;  // counts character
            if (*ptr == d1){
                d1_count++; // counts newline
            }
            if (*ptr == d2 || *ptr == d3) {
                d23_count++;  // counts spaces or tabs
            }
            ptr++;  // increments pointer
        }
    }

    free( data );

    printf("%d %d %d\n", d1_count, d23_count+1, count-1);
}

With the input specified in the question, this program has the following output:

5 205 1339

This output is not quite correct, because you are counting the number of spaces in your program, not the number of words. You seem to be attempting to compensate for this by adding 1 to the number of spaces when printing that value. However, this is not sufficient. The exact solution depends on several factors, for example how you want to handle words that are split by a hyphen and a newline character, i.e. whether you want to count such words as one word or two words. However, since this is not the problem that you stated in the question, I will not address that issue.

Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39
  • 1
    +1 for `getline()`. One caveat to using `fgets()` is that it will _only_ read (n-1) characters into the buffer — which may not be an entire line, and may bisect a word. If using `fgets()` make sure to check for leading and terminating whitespace (spaces, newlines, etc) to make sure one word wasn’t split between invocations. – Dúthomhas Sep 15 '22 at 19:41
  • @Andreas... Imagine being "cautioned" about "target values" being "split across buffer loads." I suggest responding with the link to that challenge to find "foo" in (was it?) "/sys/file"?? Get some mileage out of all that work `:-)` – Fe2O3 Sep 15 '22 at 23:12
  • @Fe2O3: I guess that you are referring to [this answer of mine](https://stackoverflow.com/a/73431557/12149471)? – Andreas Wenzel Sep 16 '22 at 00:08
  • You got the acceptance... Strange to me... 205 / 3 = 68.3333 words per line... Have you an explanation for this?? (PS: Yes, that one... I still remember the fun of "several revisions" until I had something that finally passed the testing you did (and I didn't do) `:-)` Those were the days! `:-)` – Fe2O3 Sep 16 '22 at 00:09
  • @Fe2O3: OP's code is counting the number of spaces and then adding one to that before printing it. So the actual number of spaces is `204`, which is divisible by `3`. I have added a remark at the bottom of my answer to mention that OP's code is wrong in this respect. – Andreas Wenzel Sep 16 '22 at 00:57
  • The data contains an _irregular_ number of invisible SPs (1,1,0), and the approach is flawed if the source were to have irregular spacing... fwiw, I like my approach better... All good `:-)` (Thanks for the "black ink" tune-up... I just learned something new: how to achieve that `:-)` thanks!) – Fe2O3 Sep 16 '22 at 01:02
0

EDIT: "Illusions" I did misinterpret the appearance of the highlight selector to copy/paste the sample strings from the OP to work with that data. What this highlight showed as the LF at the end of lines 1 & 3, and not on line 5, I took to mean an 'invisible SP' (or other control character) on 2 lines. Back-to-front! The shorter highlight on line 5 merely showed the absence of a LF at the end of that line. Wrong interpretation on my part.

This answer has been revised with this new insight.

It's tough to say what goes on when there are so many variables all interacting with one another.

Counting whitespace characters and fiddling their values is not the way to determine what would be recognised as a "word" (perhaps with a full stop attached)

Here's a version that is more trustworthy. It doesn't muck about reading from a file. One could use "external data" replacing indexing into this long string with c = fgetc( stdin ); (and ungetc() in the "read ahead" groping for the end of a word.)

#include <stdio.h>
#include <ctype.h>

char *in =
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "
"Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. "
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
"\n"
"\n"
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "
"Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. "
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
"\n"
"\n"
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "
"Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. "
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
;

int main() {
    int lcnt = 0, wcnt = 0, ccnt = 0;

    for( int i = 0; in[i]; i++ ) {
        ccnt++;
        if( in[i] == '\n' )
            lcnt++;
        if( !isspace( in[i] ) ) {
            wcnt++;
            while( in[i+1] && !isspace( in[i+1] ) ) ccnt++, i++;
        }
    }

    printf( "%d lines, %d words, %d chars\n", lcnt, wcnt, ccnt );
    return 0;
}

And this is the output

4 lines, 207 words, 1339 chars

EDIT: Obviously, counting LF is insufficient. Better code coming soon

// 207 / 3 = 69 words per populated line.

//   445x? + 1xLF
// +   1xLF
// + 445x? + 1xLF
// +   1xLF
// + 445x?
// = 1339

EDIT2: Coming back to this, it seems the answer could do with improvement.

Dealing with the correct copy/paste of the OP data, the following gives the expected results. It accounts for the missing LF on line 5, and it is branchless.

b tracks "intraword" characters, and f is a flag set by transition at the beginning of a word.

Invocations of function isspace() could be replaced with c==' ' || c=='\t'... for more speed, if desired, and the ++ moved to the for() loop's 3rd part.

int main() {
    int lcnt = 0, wcnt = 0, ccnt = 0;

    for( int f, b = 0; in[ ccnt ]; b = f ) {
        lcnt += in[ ccnt ] == '\n';
        f = !isspace( in[ ccnt++ ] );
        wcnt += !b && f;
    }
    // humans don't notice if final LF present or not
    lcnt += (in[ ccnt - 1] != '\n');

    printf( "%d lines, %d words, %d chars\n", lcnt, wcnt, ccnt );
    return 0;
}
Fe2O3
  • 6,077
  • 2
  • 4
  • 20
  • In your answer, you wrote: "there appears to be an invisible SP at the end of the first two lines" -- [I am unable to verify this, when looking at the ASCII codes, even when I take the input directly from the HTML source.](https://godbolt.org/z/x7E44cTzr) Can you find the exact offset that you are talking about? Offset `0x1BD` looks pretty normal to me. – Andreas Wenzel Sep 16 '22 at 01:15
  • @AndreasWenzel To get the OP data, I "selected" that text in the OP and noticed the 1st two "blah-blah" lines "highlighted" an invisible character at the end of those two lines; not so for the 3rd "blah-blah" line... Just sweep your mouse across the sample data... Going to original source may indicate the OP is being devious and hiding a trap for the unweary... :-) (It may not be SP, but there's something else there on those two lines.) – Fe2O3 Sep 16 '22 at 01:19
  • I am unable to find that "invisible character" you are talking about. I am using Firefox. Can you please post your input into the godbolt link that I sent you and see if you can see the ASCII code of that character. If so, please send me the updated link and tell me the offset, so I can take a look at it. – Andreas Wenzel Sep 16 '22 at 01:21
  • Well, it looks like I was jumping to conclusions about SP. I highlighted/copied the text from the OP (still with longer highlighting on lines 1 & 3 than on line 5). Then I pasted it into my text editor and saved the file as Unix (LF only). The file is 1339 bytes without the usual LF at the very end, 1340 bytes when there are 5 LFs and 1343 when "pasted and saved as DOS file" (CR/LF)... My guess now? There are a couple of CR's hiding in the text posted above, showing up when "highlighted"... How they are dealt with 'downstream' depends on 'downstream'. Enough of this, I think... – Fe2O3 Sep 16 '22 at 01:43
  • 1
    When I download this Stack Overflow web page bypassing my web browser, using [`curl`](https://en.wikipedia.org/wiki/CURL) instead, the HTML source of OP's input contains `\n\n` after the first line ending of OP's input. It is not `\r\n\n`, as implied in your posted source code. There is also no byte with an ASCII code which would explain your "invisible character". Therefore, I guess that this is an issue with your web browser. Which browser are you using? Can you provide a [hex dump](https://en.wikipedia.org/wiki/Hex_dump) of OP's input that you get after using copy&paste? – Andreas Wenzel Sep 16 '22 at 19:28
  • @AndreasWenzel Oops! Edit to my answer tries to explain things (as I saw them). I'll soon add a 'better' version... Sorry for the commotion... I misinterpreted what the UI was showing me. – Fe2O3 Sep 16 '22 at 23:39