Function that finds whether 2 string are made from same words

Question

I need to create a function in C, which finds out if 2 strings are made from same words. As can be seen in current code, I loaded each string in separate array. I made it that in the array there are words, all in lower case letters with just 1 space between each word and without all non-alpha characters. I though, that I could just sort the string and call strcmp on them, but it can't be done so, because of the reason, that there can be strings such as "dog dog dog cat" and "dog cat" , these strings are from same words, so the function should return 1, but it wouldnt if just sorted and used strcmp. So i though, I could merge all duplicated words in 1 and then sort and strcmp, but there is still one problem, that when there would be words such as "dog" and "god" , these are 2 different words, but the function would still take them as same after sorting. "dog dog dog cat" "dog cat" - same words "HI HeLLO!!'" "hi,,,hello hi" - same words I would be very thankful for any help. I really don't know how to create it, I sat at it for quite some time and still can't figure it.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int sameWords( const char * a, const char * b)
{
char * array1=NULL;
char * array2=NULL;
int length1=0, length2=0, i=0, j=0;
while(a[i])
  {
  if(i>=length1)
    {
      length1+=250;
      array1=(char*)malloc(length1*sizeof(char));
    }
  if(isspace(a[i]) && !isspace(a[i-1]))
    {
      array1[i]=a[i];
    }
  if(isalpha(a[i]))
    {
      array1[i]=tolower(a[i]);
    }
  i++;
  }
while(b[j])
  {
  if(j>=length2)
    {
      length2+=250;
      array2=(char*)malloc(length2*sizeof(char));
    }
  if(isspace(b[j]) && !isspace(b[j-1]))
    {
      array2[j]=b[j];
    }
  if(isalpha(b[j]))
    {
      array2[j]=tolower(b[j]);
    }
  j++;
  }
}

int main()
{
sameWords("This' is   string !!! ", "THIS stRing is !!  string ");
return 0;
}

You should just try taking each real word from string 1, one at a time and see if string 2 contains it. If all words from string 1 were in string 2, do the same thing for string 2 words contained in string 1. No sorting, just check if each valid word is contained in the other string. If all valid words from string 1 are in string 2 and vice versa, then you've got a match. — Josh, Dec 02 '13 at 15:44
"*'dog' and 'god' , these are 2 different words, but the function would still take them as same after sorting.*" Why? — alk, Dec 02 '13 at 15:50
Yes, thanks. This was what I was also thinking, but I somehow don´t know how to implement it. How to take one word after another and check if they are in the other string. — beranpa8, Dec 02 '13 at 15:50
Why not implement what you propose for each of the to strings: merging all occurencies of a word into one and sortt them, then compare the two arrays. — alk, Dec 02 '13 at 15:54
Because that would be the thing with the dog and god. Dog and god are different words and I think that it would sort letters, so it would be "dgo" and it would be the same "dgo" in both arrays, so it would compare them as same. — beranpa8, Dec 02 '13 at 15:58
Basic question: Does your defintion in the context of this question allows a "string" to contain white-spaces? — alk, Dec 02 '13 at 16:34

score 0 · Answer 1 · answered Dec 02 '13 at 15:44

0

You are returning nothing from your function sameWords whose return type is int.

answered Dec 02 '13 at 15:44

haccks

104,019
25
176
264

It should be 250, because right before the allocation there is { length1+=250;} – beranpa8 Dec 02 '13 at 15:46
I know I am returning nothing right now, but it should return 1 if those string are from same words and 0 otherwise. Right now I just need to find out how to find out whether there are same words or not. – beranpa8 Dec 02 '13 at 15:49
Ok, I´m sorry, but I´m new to programming. But the returning isn´t the problem right now. – beranpa8 Dec 02 '13 at 15:59

score 0 · Answer 2 · edited May 23 '17 at 11:56

0

I don't pretend to be awarded as the answer, but I would take a look at regular expressions too for this kind of things.

Does C or C++ have a standard regex library?

It would take minutes to solve it, you split the string with regex, lowercase-it, and then iterate to look after common words.

edited May 23 '17 at 11:56

Community

1
1

answered Dec 02 '13 at 15:44

Giupo

413
2
9

Thank you, but I can´t probably use reg.ex. because it is something like homework to school and we yet didn´t learn regular expressions, so I can´t use them. – beranpa8 Dec 02 '13 at 15:55
Can you add the constraints of your homework? can you use functions like `strtok`, `strstr` from `` ? – Giupo Dec 02 '13 at 16:07

score 0 · Answer 3 · answered Dec 02 '13 at 16:03

0

What I would do to solve this problem is create a data structure like a tree into which you can insert words. The insert function would do nothing if the word is already there, otherwise, it would convert it to lowercase and insert it in the tree. Then you could simply convert both strings to these types of trees and compare the trees.

Another way to do this is in bash. While this is probably not allowed for you assignment, if you understand how and why it works, you should be able to code something up that mimics it:

# string1 and string2 are simply strings with spaces separating words
s1="dog dog dog cat"
s2="cat dog"

# Convert to arrays
a1=( $(printf "%s\n" ${s1}  | sort | uniq ) )
a2=( $(printf "%s\n" ${s2}  | sort | uniq ) )

# Compare the result
if [ "${a1[*]}" == "${a2[*]}" ] ; then
  echo "Same"
fi

answered Dec 02 '13 at 16:03

Trenin

2,041
1
14
20

I though I would take separate words from string1 and search whether they are in string 2, one after another. And then take words from 2 and search for them in 1. But I don´t know how to separate the words from the string. That is the main problem. – beranpa8 Dec 02 '13 at 16:18
`man strtok` :) or Google "`strtok`". The **magic** term is is *tokenize* – Giupo Dec 02 '13 at 16:34
Or you could use scanf for formatted input. You can even use it on strings. fscanf to read from a file (or STDIN), sscanf to read from a string. – Trenin Dec 03 '13 at 12:47
@haccks As I said in the answer, this probably isn't applicable for OP's homework, but this script will work. Since it is homework, I thought this was a good example that doesn't do all the work for him, but rather shows an example of what works, but still leaves some work to do. – Trenin Dec 03 '13 at 12:48

score 0 · Accepted Answer · answered Dec 02 '13 at 18:41

You have already learned two ways to go about your problem. The complicated one is to split each of the strings into words, sort them and then weed out duplicates, which is easy in a sorted array. The easier one is to split the first string into words, search for each word in the second. Then do the same the other way round: split the second and check for words in the first.

Both approaches require that you split the strings. That's also where you seem to have problems in your code. (You've got the basic idea to look at word boundaries, but you don't seem to know how to store the words.)

The basic question is: How are you going to represent the words, i.e. the substrings of a C string? There are various ways. You could use pointers into the string together with a string length or you could copy them into another buffer.

Here is a sloution that splits the string a into words and then checks whether each word can be found in b:

/*
 *      Return 1 if all words in a can be found in b, 
 *      return 0 otherwise.
 */
int split_and_check(const char *a, const char *b)
{
    int begin = -1;    /* marker for beginning of word */
    char word[80];     /* temporary buffer for current word */
    int prev = 0;      /* previously read char to detect word bounaries */
    int len;           /* current length of word */
    int i;

    i = 0;
    while (1) {
        if (isalpha(a[i])) {
            if (!isalpha(prev)) {
                begin = i;
                len = 0;
            }
            if (len < 80) word[len++] = a[i];
        } else {
            if (len > 0) {
                word[len] = '\0';       /* manually null-terminate word */

                if (strstr(b, word) == NULL) {
                    /* fail on string mismatch */
                    return 0;
                }
                len = 0;                /* reset word-length counter */
            }
        }
        if (a[i] == '\0') break;        /* check end here to catch last word */
        prev = a[i++];
    }

    return 1;
}

The current word is stored in the local char buffer word and has the length len. Note how the zero end marker '\0' is added to word manually before searching b for word: The library function strstr looks for a string in another one. Both strings must be zero-terminated.

This is only one half of the solution. You must check the strings the other way round:

int same_words(const char *a, const char *b)
{    
    if (split_and_check(a, b) == 0) return 0;
    if (split_and_check(b, a) == 0) return 0;

    return 1;
}

This is not yet the exact solution to your problem, because the string matching is done case-sensitively. I've skipped this part, because it was easier that way: strstr is case sensitive and I don't know of any variants that ignore the case.

You can convert both arrays to lower case before to solve the case problem. — Trenin, Dec 03 '13 at 12:49
Thank you very much. I am gonna go through your code and I think I could learn from that how to do it. — beranpa8, Dec 03 '13 at 20:32

Function that finds whether 2 string are made from same words

4 Answers4