1

I'm trying to get the text from a file and split it into words by removing spaces and other symbols. This is part of my code for handling the file:

void addtext(char wordarray[M][N])
{
    FILE *fileinput;
    char word[N];
    char filename[N];
    char *pch;
    int i=0;
    printf("Input the name of the text file(ex \"test.txt\"):\n");
    scanf("%19s", filename);
    if ((fileinput = fopen(filename, "rt"))==NULL)
    {
        printf("cannot open file\n");
        exit(EXIT_FAILURE);
    }
    fflush(stdin);
    while (fgets(word, N, fileinput)!= NULL)
    {
       pch = strtok (word," ,'.`:-?");
       while (pch != NULL)
       {
         strcpy(wordarray[i++], pch);
         pch = strtok (NULL, " ,'.`:-?");
       }

    }
    fclose(fileinput);
    wordarray[i][0]='\0';
    return ;
}

But here is the issue. When the text input from the file is:

Alice was beginning to get very tired of sitting by her sister on the bank.

Then the output when I try to print it is this:

Alice
was
beginning
to
get
very
tired
of
sitting
by
her
s
ister
on
the
bank

As you can see, the word "sister" is split into 2. This happens quite a few times when adding a bigger text file. What am I missing?

Joe Durner
  • 27
  • 3

2 Answers2

0

If you count the characters you'll see that s is the 57th character. 57 is 19 times 3 which is the number of parsed characters in each cycle, (20 -1, as fgets null terminates the string and leaves the 20th character in the buffer).

As you are reading lines in batches of 19 characters, the line will be cuted every multiple of 19 charater and the rest will be read by the next fgets in the cycle.

The first two times you where lucky enough that the line was cutted at a space, character 19 at the end of beggining, character 38 at the end of tired, the third time it was in the midle of sister so it cuted it in two words.

Two possible fixes:

  • Replace:

    while (fgets(word, N, fileinput)!= NULL)
    

    With:

    while (fscanf(fileinput, %19s, word)  == 1)
    

    Provided that there are no words larger than 19 in the file, which is the case.

  • Make word large enough to take whole the line:

    char word[80];
    

    80 should be enough for the sample line.

anastaciu
  • 23,467
  • 7
  • 28
  • 53
  • N is indeed 19. Then the problem is the N in fgets as I thought. Can this be fixed or should I try a different approach? – Joe Durner Apr 27 '21 at 09:53
  • @JoeDurner, yes, it's easily fixable, you can simply make `word` be large enough to get the whole line. – anastaciu Apr 27 '21 at 09:54
  • @JoeDurner added 2 possible solutions to the answer. – anastaciu Apr 27 '21 at 09:58
  • 1
    Both of those seem to be working, but fscanf does seem to be working better. Wanted to avoid it at first since I already used in and wanted to experiment. Thanks a bunch. – Joe Durner Apr 27 '21 at 10:04
  • @JoeDurner good mindset, experimenting is a great way of learning, I'll note that as a general rule, `fgets` is better, but in this particular case `fscanf` does seem to be a better option, as always, it's a matter of picking the right tool for the job. – anastaciu Apr 27 '21 at 10:08
0

What am I missing?

You are missing that a single fgets call at maximum will read N-1 characters from the file, Consequently the buffer word may contain only the first part of a word. For instance it seems that in your case the s from the word sister was read by one fgets call and that the remaining part, i.e. ister was read by the next fgets call. Consequently, your code detected sister as two words.

So you need to add code that can check whether the end of the is a whole word or a part of a word.

To start with you can increase N to a higher number but to make it work in general you must add code that checks the end of the word buffer.

Also notice that long words may require more than 2 fgets call.

As a simple alternative to fgets and strtok consider fread and a simple char-by-char passing of the input.

Below is a simple, low-performance example of how it can be done.

int isdelim(char c)
{
  if (c == '\n') return 1;
  if (c == ' ') return 1;
  if (c == '.') return 1;
  return 0;
}

void addtext(void)
{
    FILE *fileinput;
    char *filename = "test.txt";

    if ((fileinput = fopen(filename, "rt"))==NULL)
    {
        printf("cannot open file\n");
        return;
    }

    char c;
    int state = LOOK_FOR_WORD;
    while (fread(&c, 1, 1, fileinput) == 1)
    {
      if (state == LOOK_FOR_WORD)
      {
        if (isdelim(c))
        {
          // Nothing to do.. keep looking for next word
        }
        else
        {
          // A new word starts
          putchar(c);
          state = READING_WORD;
        }
      }
      else
      {
        if (isdelim(c))
        {
          // Current word ended
          putchar('\n');
          state = LOOK_FOR_WORD;

        }
        else
        {
          // Current word continues
          putchar(c);
        }
      }
    }

    fclose(fileinput);
    return ;
}

To keep the code simple it prints the words using putchar instead of saving them in an array but that is quite easy to change.

Further, the code only reads one char at the time from the file. Again it's quit easy to change the code and read bigger chunks from the file.

Likewise you can add more delimiters to isdelim as you like (and improve the implementation)

Support Ukraine
  • 42,271
  • 4
  • 38
  • 63