Remove most common word from string in C

Question

I need to remove all occurrences of most common word from string in C.

If there are several words in the text that are repeated the same number of times, the function should remove the one of the most common words that is closest to the beginning of the string. When omitting words, you should not omit surrounding spaces and other characters. If the received string does not contain any words, the function does not need to do anything.

A word is defined as an array of uppercase and lowercase letters. The function does not need to distinguish between uppercase and lowercase letters

My algorithm is the following:

find how many times the most common word appears in string
then go word by word through string
check if word occurrence is equal to occurrence of most common word
remove found most common word

Code:

#include <stdio.h>
#include <limits.h>
#include <ctype.h>

int number_of_word_occurrence(char *s, char *start, char *end) {
  int number = 0;
  while (*s != '\0') {
    char *p = start;
    char *q = s;
    while (p != end) {
      if (*p != *q)break;
      p++;
      q++;
    }
    if (p == end)number++;
    s++;
  }
  return number;
}

int length(char *s) {
  char *p = s; int number = 0;
  while (*p != '\0') {
    p++;
    number++;
  }
  return number;
}

char *remove_most_common(char *s) {
  int n, max = INT_MIN;
  char *p = s;
  // Find max occurrence
  while (*p != '\0') {
    char *start = p;
    int word_found = 0;
    while (toupper(*p) >= 'A' && toupper(*p) <= 'Z' && *p != '\0') {
      word_found = 1;
      p++;
    }
    if (word_found) {
      n = number_of_word_occurrence(s, start, p);
      if (n > max)max = n;
    }
    p++;
  }
  p = s;
  int len = length(s);
  char *end = s + len;
  int i;
  // Removing most common word
  while (p != end) {
    char *start = p, *pp = p;
    int word_found = 0;
    while (toupper(*pp) >= 'A' && toupper(*pp) <= 'Z' && pp != end) {
      word_found = 1;
      pp++;
    }
    if (word_found) {
      n = number_of_word_occurrence(s, start, pp);
      // If word has max, then remove it
      if (n == max) {
        while (pp != end) {
          *start = *pp;
          start++;
          pp++;
        }
        end -= max; // resize end of string
        len-=max;
      }
    }
    p++;
  } 
  s[len+=2]='\0';
  return s;
}

int main() {
  char s[1000] = "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop";
  printf("%s ", remove_most_common(s) );
  return 0;
}

words that should be removed are in bold

EXAMPLE 1: "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop"

OUTPUT: "Koristio sam auto- da dodjem do znaka ali prije stopa sam otvorio dekstop kompjutera "

EXAMPLE 2: " This is string. "

OUTPUT: " is string. "

EXAMPLE 3: "1PsT1 psT2 3Pst pstpst pst";

OUTPUT: " "11 2 3 pstpst "

EXAMPLE 4: "oneeebigggwwooorrrddd";

OUTPUT: ""

Could you help me to fix my code? I have some errors while removing characters. Also, could you help me to remove the word closest to the beginning if all of word occurrences are the same?

Note: Functions from the string.h, stdlib.h libraries, as well as the sprintf and sscanf functions from the stdio.h library are not allowed when solving the task. It is not allowed to create auxiliary strings in function or globally.

What should happen if there is a tie for the most common word? — Karl Knechtel, Jun 26 '22 at 00:21
In your own words, where the code says `while (toupper(*p) >= 'A' && toupper(*p) <= 'Z' && *p != '\0') {`, what is the purpose of the `*p != '\0'` part? Now, consider the implications of that - you have a string that is supposed to be shorter than the original input, that you build in place by moving individual chars around. So - what should be done, in order to indicate the new, modified length of the string? (Since the code does not already do this, do you understand exactly the results you are getting?) — Karl Knechtel, Jun 26 '22 at 00:23
Does this answer your question? [null terminating a string](https://stackoverflow.com/questions/2911089/null-terminating-a-string) — Karl Knechtel, Jun 26 '22 at 00:24
thank you very much, it helped me, but I have some error in the middle also, instead of `ali` I got `al`... — Amar, Jun 26 '22 at 00:52
I have errors other than null terminating string, I edited question a little bit to make more clear what I need, I added another example... — Amar, Jun 26 '22 at 01:08
@Amar I'm sorry but your examples make no sense. In Example 3, you want to remove "pst" from the string, but you don't want to remove it from everywhere, how come? What is the logic here? — Adrian, Jun 26 '22 at 04:40
@Adrian The definition of a *"word"* here seems to be any sequence of alphabetic characters, delimited by non-alphabetic characters or the boundaries of the containing string. Meaning `pst` and `pstpst` are two different words. Note the difference between `stop` and `stopa` in the first example. — Oka, Jun 26 '22 at 05:19
"go word by word through string" is really not a good idea. You want to read the data as little as possible. Here's a simple example that reads the data once to build the data structures, and a 2nd time to do output. If you are every looking at your data multiple times, you should probably refactor: https://github.com/wrp/examples/blob/main/c/remove-common-word.c — William Pursell, Jun 27 '22 at 21:34

Oka · Accepted Answer · 2022-06-27T05:37:51.740

All the major issues are due to the fact that the string is a source of information, while being actively altered.

In general, words are not tokenized properly.

With the input "hello world" as an example, each of hello, ello, llo, lo, and o are considered words when tokenizing the string while looking for the word to remove. The program only advances the string by one character when scanning for words.

The program should advance the string by the length of the current token.

number_of_word_occurrence considers any substring a valid word when making comparisons.

For the input

Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop

the maximum count is incorrectly found to be 5, for stop. This problem compounds with the problem above, and starts removing incorrectly tokenized data that reports this occurrence count.

To generalize, a large problem with this approach is that as you remove a word from the string, the number of occurrences is going to be different for that word, the next time it is found. Looking at

hello hello hello world world

The max occurrence count of a word here is 3, for hello. Looping through to remove the maximum word will see hello the first time, check its occurrence count, find it to be 3, the max, and remove it.

 hello hello world world

For the second hello, checking its occurrence count now will return 2, since the string was altered. This is not the max of 3, so the string is unchanged.

Additionally, the string is not null-terminated as it is altered - only afterwards. Meaning searching for a word might read stale data, giving bad results.

The strict limitation on what functionality the program can utilize (particularity the restrictions on dynamic memory and auxiliary buffers) does make for very exhaustive solutions.

One solution is to first discover the maximum occurrence count of any word in the string, and hold a pointer to this word's first appearance in the string. Then, do count times operations removing the last appearance of the word, which allows you to always have the first appearance as a point of comparison.

A word is removed by overwriting it with everything that follows it in the string (including the null-terminating byte).

Here is a cursory example - largely untested, but provides the correct results for the examples shown. Compile with -DDEBUG to see additional information.

#include <ctype.h>
#include <stdio.h>

typedef struct word {
    const char *base;
    size_t length;
} word;

#define length(s) (span((s), NULL))
size_t span(const char *base, int (*test)(int))
{
    const char *end = base;

    while (test ? test((unsigned char) *end) : *end)
        end++;

    return (size_t) (end - base);
}

int eql(word a, word b)
{
#ifdef DEBUG
    fprintf(stderr, "DEBUG: A{%zu}<<%.*s>> <=> B{%zu}<<%.*s>>\n",
            a.length, (int) a.length, a.base,
            b.length, (int) b.length, b.base);
#endif

    if (!a.length || !b.length || a.length != b.length)
        return 0;

    if (a.base == b.base)
        return 1;

    for (size_t i = 0; i < a.length; i++)
        if (tolower((unsigned char) a.base[i]) != tolower((unsigned char) b.base[i]))
            return 0;

    return 1;
}

word get_word(const char *s, const char **end)
{
    word w = { 0 };

    while (*s && !isalpha((unsigned char) *s))
        s++;

    w.base = s;
    w.length = span(s, isalpha);

    *end = (s + w.length);

    return w;
}


word find_last(const char *s, word mark, unsigned *count)
{
    word last = { 0 };
    unsigned c = 0;

    for (const char *end; *s; s = end) {
        word current = get_word(s, &end);

        if (eql(mark, current)) {
            last = current;
            c++;
        }
    }

    if (count)
        *count = c;

    return last;
}

word find_most_common(const char *s, unsigned *count)
{
    word most_common = { 0 };

    *count = 0;

    for (const char *end; *s; s = end) {
        word current = get_word(s, &end);

        if (eql(most_common, current))
            continue;

        unsigned c;

        (void) find_last(s, current, &c);

        if (c > *count) {
            most_common = current;
            *count = c;
        }
    }

    return most_common;
}

void copy(char *dest, char *source, size_t length)
{
    for (size_t i = 0; i < length; i++)
        dest[i] = source[i];
}

void remove_most_common(char *s)
{
    unsigned count = 0;
    word common = find_most_common(s, &count);

#ifdef DEBUG
    if (count)
        fprintf(stderr, "DEBUG: MOST COMMON WORD: [[%.*s]]x%u\n",
                (int) common.length, common.base, count);
#endif

    size_t len = length(s);

    while (count--) {
        word last = find_last(s, common, NULL);

        copy(
            (char *) last.base,
            (char *) last.base + last.length,
            len - (size_t) (last.base - s) + 1);

        len -= last.length;
    }
}

int main(void)
{
    char buffer[4096];

    if (!fgets(buffer, sizeof buffer, stdin))
        return 1;

    size_t len = length(buffer);

    if (len && buffer[len - 1] == '\n')
        buffer[len - 1] = 0;

    printf("<<%s>>\n", buffer);
    remove_most_common(buffer);
    printf("<<%s>>\n", buffer);
}

thank you very much, but could you edit this according to this professor's instruction: `Create a helper function that counts how many times a word is repeated in a string. Then, for each word, call that function, and if the number of repetitions is n> max, max = n and save the pointers to the beginning and end of that max word. Finally, remove everything between those two pointers.` — Amar, Jun 27 '22 at 14:12
@Amar The code example I've posted here is a proof of the method I described, and answers the original question. It is for you to read and understand. It is not an attempt to solve your homework for you, and this website is not a *do-my-homework* programming service. You must do your own homework. Understanding my example may help you achieve this. If you get stuck while attempting to solve this new problem on your own, you can again [ask](https://stackoverflow.com/questions/ask) a *separate* question, detailing this new problem. Questions must each be about a *specific* problem or topic. — Oka, Jun 27 '22 at 14:22

score 1 · Answer 2 · answered Jun 27 '22 at 03:14

I decided to write a new function that removes all occurrences of a word from a string, which exactly behaves how you want it to. You only need to provide the source and the word that needs to be removed.

Code:

#include <stdio.h>
#include <ctype.h> // toupper
#include <string.h> // strlen
#include <stdbool.h> // bool

void removeWord(char* source, char* removeThis)
{
    int i, j;
    bool wordFound;
    int sourceLength, removeLength;

    sourceLength = strlen(source);
    removeLength = strlen(removeThis);

    for(i = 0; i < sourceLength; ++i)
    {
        wordFound = true;

        for(j = 0; j < removeLength; ++j)
        {
            if(toupper(source[i + j]) != toupper(removeThis[j]))
            {
                wordFound = false;
                break;
            }
        }

        // It is not a word if the previous character or after the last one is alphabetic
        if(wordFound && (isalpha(source[i + j]) || (i > 0 && isalpha(source[i - 1]))))
        {
            wordFound = false;
        }

        if(wordFound)
        {
            for(j = i; j <= sourceLength - removeLength; ++j)
            {
                source[j] = source[j + removeLength];
            }

            --i;
            sourceLength -= removeLength;
        }
    }
}

int main()
{
    char string1[] = "Koristio sam auto-stop da dodjem do znaka stop ali prije stopa sam otvorio dekstop kompjutera stop";
    removeWord(string1, "stop");
    printf("\n%s", string1);

    char string2[] = {"This is string."};
    removeWord(string2, "this");
    printf("\n%s", string2);

    char string3[] = "1PsT1 psT2 3Pst pstpst pst";
    removeWord(string3, "pst");
    printf("\n%s", string3);

    char string4[] = "oneeebigggwwooorrrddd";
    removeWord(string4, "oneeebigggwwooorrrddd");
    printf("\n%s", string4);

    char string5[] = "Tomtom";
    removeWord(string5, "tom");
    printf("\n%s", string5);

    return 0;
}

Output:

Koristio sam auto- da dodjem do znaka  ali prije stopa sam otvorio dekstop kompjutera
 is string.
11 2 3 pstpst

Tomtom

Based on this you should be able to write the part to find the most common word, store it, and feed it to this function.

While interesting, the restrictions imposed in the question means this has to work when `removeThis` and `source` overlap (try `removeWord(string1, strstr(string1, "stop"))` or `removeWord(string3, string3 + strlen(string3) - 3)` for examples of problems with combining these approaches). — Oka, Jun 27 '22 at 03:35
In the first case, the output is `Koristio sam auto-`. I assume that this is correct, but in the second case it removes only the first occurrence of `pst`. — Adrian, Jun 27 '22 at 04:10
The *Output* shown in your answer does contain the expected results. The problem being that when `removeThis` and `source` overlap, `strlen(removeThis)` will give the length from the start of `removeThis` to the end of `source`, and changing `source` will cause `removeThis` to change as well. In `removeWord(string1, strstr(string1, "stop"))`, `removeThis` is everything after the `'-'`. In `removeWord(string3, string3 + strlen(string3) - 3)`, `removeThis[0]` becomes the null-terminating byte after the first mutation of `source`. — Oka, Jun 27 '22 at 12:11
thank you very much, but I cannot store found word because auxiliary strings are not allowed, could you edit this according to this professor's instruction: `Create a helper function that counts how many times a word is repeated in a string. Then, for each word, call that function, and if the number of repetitions is n> max, max = n and save the pointers to the beginning and end of that max word. Finally, remove everything between those two pointers.` I already created some functions, but I don't know how to delete it correctly... — Amar, Jun 27 '22 at 14:11

Remove most common word from string in C

2 Answers2