0

I'm having some troubles using strtok function. As an exercise I have to deal with a text file by ruling out white spaces, transforming initials into capital letters and printing no more than 20 characters in a line.

Here is a fragment of my code:

fgets(sentence, SIZE, f1_ptr);
    char *tok_ptr = strtok(sentence, " \n"); //tokenazing each line read
    tok_ptr[0] = toupper(tok_ptr[0]); //initials to capital letters

    int num = 0, i;

    while (!feof(f1_ptr)) {
        while (tok_ptr != NULL) {
            for (i = num; i < strlen(tok_ptr) + num; i++) {
                if (i % 20 == 0 && i != 0) //maximum of 20 char per line
                    fputc('\n', stdout);
                fputc(tok_ptr[i - num], stdout);
            }

            num = i;

            tok_ptr = strtok(NULL, " \n");
            if (tok_ptr != NULL)
                tok_ptr[0] = toupper(tok_ptr[0]);
        }

        fgets(sentence, SIZE + 1, f1_ptr);
        tok_ptr = strtok(sentence, " \n");
        if (tok_ptr != NULL)
            tok_ptr[0] = toupper(tok_ptr[0]);
    }

The text is just a bunch of lines I just show as a reference:

Watch your thoughts ; they become words .
Watch your words ; they become actions .
Watch your actions ; they become habits .
Watch your habits ; they become character .
Watch your character ; it becomes your destiny .

Here is what I obtain in the end:

WatchYourThoughts;Th
eyBecomeWords.WatchY
ourWords;THeyBecomeA
ctions.WatchYourActi
ons;TheyBecomeHabits
.WatchYourHabits;The
yBecomeCharacteR.Wat
chYourCharacter;ItBe
comesYourDEstiny.Lao
-Tze

The final result is mostly correct, but sometimes (for example "they" in they become (and only in that case) or "destiny") words are not correctly tokenized. So for example "they" is split into "t" and "hey" resulting in THey (DEstiny in the other instance) after the manipulations I made. Is it some bug or am I missing something? Probably my code is not that efficient and some condition may end up being critical...

Thank you for the help, it's not that big of a deal, I just don't understand why such a behaviour is occurring.

AleCab
  • 3
  • 2
  • 1
    It would be worth stepping through in the debugger to see what's happening since you have a pretty simple failure case. – Retired Ninja Oct 30 '20 at 19:30
  • 5
    Apropos: [**Why is “while ( !feof (file) )” always wrong?**](https://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong) – Andrew Henle Oct 30 '20 at 19:40
  • 1
    There are multiple red flags: 1. You ignore the return-value of `fgets()`. Your outer loop should probably be `while (fgets(....)) {` 2. You don't handle a line being too long to retrieve at once. Why did you want to read lines instead of blocks anyway? 3. The argument to `toupper()` must be in the range of `unsigned char`. Casting it is the way to go there. 4. `strtok()` is not thread-safe. Do it manually unless you must. 5. When I see `pointer != NULL` I always wonder why nobody added a string of `!= false` or something similarly inane there. – Deduplicator Oct 30 '20 at 19:42

2 Answers2

1

You have a large number of errors in your code and you are over-complicating the problem. The most pressing error is Why is while ( !feof (file) ) always wrong? Why? Trace the execution-path within your loop. You attempt to read with fgets(), and then you use sentence without knowing whether EOF was reached calling tok_ptr = strtok(sentence, " \n"); before you ever get around to checking feof(f1_ptr)

What happens when you actually reach EOF? That IS "Why while ( !feof (file) ) is always wrong?" Instead, you always want to control your read-loop with the return of the read function you are using, e.g. while (fgets(sentence, SIZE, f1_ptr) != NULL)

What is it you actually need your code to do?

The larger question is why are you over-complicating the problem with strtok, and arrays (and fgets() for that matter)? Think about what you need to do:

  1. read each character in the file,
  2. if it is whitespace, ignore it, set the in-word flag false,
  3. if a non-whitespace, if 1st char in word, capitalize it, output the char, set the in-word flag true and increment the number of chars output to the current line, and finally
  4. if it is the 20th character output, output a newline and reset the counter zero.

The bare-minimum tools you need from your C-toolbox are fgetc(), isspace() and toupper() from ctype.h, a counter for the number of characters output, and a flag to know if the character is the first non-whitespace character after a whitespace.

Implementing the logic

That makes the problem very simple. Read a character, is it whitespace?, set your in-word flag false, otherwise if your in-word flag is false, capitalize it, output the character, set your in-word flag true, increment your word count. Last thing you need to do is check if your character-count has reached the limit, if so output a '\n' and reset your character-count zero. Repeat until you run out of characters.

You can turn that into a code with something similar to the following:

#include <stdio.h>
#include <ctype.h>

#define CPL 20      /* chars per-line, if you need a constant, #define one (or more) */

int main (int argc, char **argv) {
    
    int c, in = 0, n = 0;   /* char, in-word flag, no. of chars output in line */
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
    
    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }
    
    while ((c = fgetc(fp)) != EOF) {            /* read / validate each char in file */
        if (isspace(c))                         /* char is whitespace? */
            in = 0;                             /* set in-word flag false */
        else {  /* otherwise, not whitespace */
            putchar (in ? c : toupper(c));      /* output char, capitalize 1st in word */
            in = 1;                             /* set in-word flag true */
            n++;                                /* increment character count */
        }
        if (n == CPL) {                         /* CPL limit reached? */
            putchar ('\n');                     /* output newline */
            n = 0;                              /* reset cpl counter */
        }
    }
    putchar ('\n');     /* tidy up with newline */
    
    if (fp != stdin)    /* close file if not stdin */
        fclose (fp);
}

Example Use/Output

Given your input file stored on my computer in dat/text220.txt, you can produce the output you are looking for with:

$ ./bin/text220 dat/text220.txt
WatchYourThoughts;Th
eyBecomeWords.WatchY
ourWords;TheyBecomeA
ctions.WatchYourActi
ons;TheyBecomeHabits
.WatchYourHabits;The
yBecomeCharacter.Wat
chYourCharacter;ItBe
comesYourDestiny.

(the executable for the code was compiled to bin/text220, I usually keep separate dat, obj, and bin directories for data, object files and executables to keep by source code directory clean)

note: by reading from stdin by default if no filename is provided as the first argument to the program, you can use your program to read input directly, e.g.

$ echo "my dog      has   fleas  -   bummer!" | ./bin/text220
MyDogHasFleas-Bummer
!

No fancy string functions required, just a loop, a character, a flag and a counter -- the rest is just arithmetic. It's always worth trying to boils your programming problems down to basic steps and then look around your C-toolbox and find the right tool for each basic step.

Using strtok

Don't get me wrong, there is nothing wrong with using strtok and it makes a fairly simple solution in this case -- the point I was making is that for simple character-oriented string-processing, it's often just a simple to loop over the characters in the line. You don't gain any efficiencies using fgets() with an array and strtok(), the read from the file is already placed into a buffer of BUFSIZ1.

If you did want to use strtok(), you should control you read-loop your with the return from fgets()and then you can tokenize with strtok() also checking its return at each point. A read-loop with fgets() and a tokenization loop with strtok(). Then you handle first-character capitalization and then limiting your output to 20-chars per-line.

You could do something like the following:

#include <stdio.h>
#include <string.h>
#include <ctype.h>

#define CPL 20      /* chars per-line, if you need a constant, #define one (or more) */
#define MAXC 1024
#define DELIM " \t\r\n"

void putcharCPL (int c, int *n)
{
    if (*n == CPL) {            /* if n == limit */
        putchar ('\n');         /* output '\n' */
        *n = 0;                 /* reset value at mem address 0 */
    }
    putchar (c);                /* output character */
    (*n)++;                     /* increment value at mem address */
}

int main (int argc, char **argv) {
    
    char line[MAXC];    /* buffer to hold each line */
    int n = 0;          /* no. of chars ouput in line */
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
    
    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }
    
    while (fgets (line, MAXC, fp))  /* read each line and tokenize line */
        for (char *tok = strtok (line, DELIM); tok; tok = strtok (NULL, DELIM)) {
            putcharCPL (toupper(*tok), &n);     /* convert 1st char to upper */
            for (int i = 1; tok[i]; i++)        /* output rest unchanged */
                putcharCPL (tok[i], &n);
        }
    putchar ('\n');     /* tidy up with newline */
    
    if (fp != stdin)    /* close file if not stdin */
        fclose (fp);
}

(same output)

The putcharCPL() function is just a helper that checks if 20 characters have been output and if so outputs a '\n' and resets the counter. It then outputs the current character and increments the counter by one. A pointer to the counter is passed so it can be updated within the function making the updated value available back in main().

Look things over and let me know if you have further questions.

footnotes:

1. Depending on your version of gcc, the constant in the source setting the read-buffer size may be _IO_BUFSIZ. _IO_BUFSIZ was changed to BUFSIZ here: glibc commit 9964a14579e5eef9 For Linux BUFSIZE is defined as 8192 (512 on Windows).

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • The first solution you propose is very simple indeed, I also thought of that one myself and it is actually kind of the same provided by the book I'm using. My intention using strtok was actually to experiment a command I didn't know and that I thought could come in useful when dealing with string manipulations. As my code makes it pretty evident, I'm quite a beginner, I'm not that comfortable using pointers yet. – AleCab Oct 31 '20 at 08:21
  • That is a good way to learn. Take any problem and program two (or more) independent solutions that solve the problem -- which gives you a deeper understanding of each of the solutions. If you are struggling with pointers, then the following may help [Difference between char *pp and (char*) p?](https://stackoverflow.com/a/60519053/3422102) and [Pointer to pointer of structs indexing out of bounds(?)...](https://stackoverflow.com/a/60639540/3422102) (ignore the titles to the questions -- the answers contain a step-by-step introduction to pointers) – David C. Rankin Oct 31 '20 at 08:26
  • I pressed enter by error earlier, just wanted to add a couple of things. I actually thought that !feof condition was better than fgets != NULL. The two were presented to me as equally good alternatives, I thought feof was best for clarity purposes. I'll pay more attention to it in the future. Thank you for the exhaustive reply. – AleCab Oct 31 '20 at 08:41
  • You generally use `feof()` after an *unformatted-read* such as with `fread()` or the `read()` syscall. For *line-oriented* input such as with `fgets()` or POSIX `getline()` the return from each of those function indicate success or failure - so they provide a direct indication that can be used to control the read-loop. Good luck with your coding! Also from the links above you know there is nothing special about a pointer. It is simply a variable that holds the address of another object as its value. Most string functions do return pointers -- so understand you get the address to the string. – David C. Rankin Oct 31 '20 at 08:45
0

This is actually a much more interesting OP from a professional point of view than some of the comments may suggest, despite the 'newcomer' aspect of the question, which may sometimes raise fairly deep, underestimated issues.
The fun thing is that on my platform (W10, MSYS2, gcc v.10.2), your code runs fine with correct results:

WatchYourThoughts;Th
eyBecomeWords.WatchY
ourWords;TheyBecomeA
ctions.WatchYourActi
ons;TheyBecomeHabits
.WatchYourHabits;The
yBecomeCharacter.Wat
chYourCharacter;ItBe
comesYourDestiny.

So first, congratulations, newcomer: your coding is not that bad.
This points to how different compilers may or may not protect against limited inappropriate coding or specification misuse, may or may not protect stacks or heaps.
This said, the comment by @Andrew Henle pointing to an illuminating answer about feof is quite relevant.
If you follow it and retrieve your feof test, just moving it down after read checks, not before (as below). Your code should yield better results (note: I will just alter your code minimally, deliberately ignoring lesser issues):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <ctype.h>
#define SIZE 100 // add some leeway to avoid off-by-one issues

int main()
{
    FILE* f1_ptr = fopen("C:\\Users\\Public\\Dev\\test_strtok", "r");
    if (! f1_ptr)
    {
             perror("Open issue");
             exit(EXIT_FAILURE);
    }

char sentence[SIZE] = {0};

if (NULL == fgets(sentence, SIZE, f1_ptr))
{
     perror("fgets issue"); // implementation-dependent
     exit(EXIT_FAILURE);
}
errno = 0;
char *tok_ptr = strtok(sentence, " \n"); //tokenizing each line read

if (tok_ptr == NULL || errno)
{
     perror("first strtok parse issue");
     exit(EXIT_FAILURE);
}

tok_ptr[0] = toupper(tok_ptr[0]); //initials to capital letters

int num = 0;
size_t i = 0;

while (1) {
    while (1) {
        for (i = num; i < strlen(tok_ptr) + num; i++) {
            if (i % 20 == 0 && i != 0) //maximum of 20 char per line
                fputc('\n', stdout);
            fputc(tok_ptr[i - num], stdout);
        }

        num = i;

        tok_ptr = strtok(NULL, " \n");
        if (tok_ptr == NULL) break;
        tok_ptr[0] = toupper(tok_ptr[0]);
    }

    if (NULL == fgets(sentence, SIZE, f1_ptr)) // let's get away whith annoying +1,
                                               // we have enough headroom
    {
       if (feof(f1_ptr))
       {
           fprintf(stderr, "\n%s\n", "Found EOF");
           break;
       }
       else
       {
           perror("Unexpected fgets issue in loop"); // implementation-dependent
           exit(EXIT_FAILURE);
       }
    }

    errno = 0;
    tok_ptr = strtok(sentence, " \n");
    if (tok_ptr == NULL)
    {
        if (errno)
        {
          perror("strtok issue in loop");
          exit(EXIT_FAILURE);
        }

        break;
    }

    tok_ptr[0] = toupper(tok_ptr[0]);
}

return 0;

}

$ ./test

WatchYourThoughts;Th
eyBecomeWords.WatchY
ourWords;TheyBecomeA
ctions.WatchYourActi
ons;TheyBecomeHabits
.WatchYourHabits;The
yBecomeCharacter.Wat
chYourCharacter;ItBe
comesYourDestiny.
Found EOF
  • Couple of notes, you may want to make your outer loop simply `while (fgets(sentence, SIZE, f1_ptr))` then you can tokenize and output each char in the first token. (you can also check the last char in `sentence` is `'\n'` to validate you read the whole line) That leaves a single inner loop of `while (tok_ptr = strtok(NULL, " \n"))`. (note when `fgets()` fails, `errno` isn't set -- making the call to `perror("fgets issue");` a bit out of place `:)` – David C. Rankin Oct 31 '20 at 03:55
  • With due respect: 1) `strtok` is not a "fancy" string function, it is standard-complying (SVr4, POSIX.1-2001, BSD 4.3, C89, C99.) . OK, not to be used in multithreading environments but not banned or deprecated either. 2) Simplifying/beautifying the OP's algorithm is not the issue. Nor replacing `strtok` with simpler library calls. As I wrote, the code was minimally patched. 3) You are partly right about `errno` not being set by `fgets`, and partly wrong: this is implementation-dependent (standards say `perror` displays errors of system *and library* calls, and `fgets` is a C library call). –  Oct 31 '20 at 06:28
  • No no, don't get me wrong, I'm not saying you can't use `strtok` and I'm not dinging you for it. I'm simply saying you can order your loop in a much cleaner way to make the process simpler. I'll drop an example. – David C. Rankin Oct 31 '20 at 06:31
  • I do also hope you noticed the smiley face `:)` after the note about `fgets()` and `errno`, it wouldn't hurt anything if it wasn't implemented other than likely outputting `"fgets issue: SUCCESS"` in that case -- which is what prompted the out of place comment. The standard is silent on leaving `errno` to the implementation [7.21.7.2 The fgets function](http://port70.net/~nsz/c/c11/n1570.html#7.21.7.2) Some compilers (one major one in particular) implement a number of non-standard and non-POSIX features. (such as `fflush(stdin)` for all use cases) – David C. Rankin Oct 31 '20 at 07:09
  • Whether `perror` is outputting `SUCCESS` or no after a NULL return from `fgets` is informative for debugging purposes. A system running out of memory, or a file being inaccessible because another process has blocked read access in some way that did not block `fopen`, may not allow buffer memory allocation and/or cause `fgets` to crash. In this case and others, some C library implementations will throw out an error message, which may be useful. –  Oct 31 '20 at 07:52
  • No arguments there. You have done a very thorough job in error checking and diagnosis of any `fgets()` failure. Distinguishing `EOF` and some other read error can be informative. Make you a deal. Clean up the indentation from `char sentence[SIZE] = {0};` on down to make it consistent (either format by indenting all by 4-spaces or by wrapping your normal indentation with `\`\`\`c` at top and `\`\`\`` at the bottom so everything formats as code) -- and I'll happily vote for it. – David C. Rankin Oct 31 '20 at 07:57
  • I really do not understand that comment and I certainly didn't mean to be condescending. Offer stands. Always remember, you will catch many more flies in life with honey than you will with salt. – David C. Rankin Nov 01 '20 at 08:27