0

I have two vectors containing strings. I want to compare each string of vector1 with each string of vector2 and check how many words are the same in both strings. The code I have only works if the two strings are perfectly similar :

Compare::Compare(vector<string> text1, vector<string> text2, int ratio)
{
    text1Size_ = text1.size();
    text2Size_ = text2.size();

    if(text1Size_ > text2Size_)
    {
        totalWords_ = text1Size_;
    }
    else
    {
        totalWords_ = text2Size_;
    }

    it = text1.begin();

    for(int i = 0; i < text1Size_; i++)
    {
        it2 = text2.begin();

        for(int i = 0; i < text2Size_; i++)
        {
            if(*it == *it2)
            {
                cout << "Perfect match";
            }
            it2++;
        }
        it++;
    }
}

I need to return each similar string if they have at least the ratio of similar words.

Is there a easier way than to parse each string, put each word in an array and compare them?

-EDIT-

By word I mean a written word like "bird". I'll give an example.

Let says I only have one string per vector and I need a 70% ratio of similarities:

string1 : The blue bird.
string2 : The bird.

What I want to do is to check if there is at least 60% of the written words that match in both sentences.

Here I have "The" and "Bird" that match. So I have 2/3 similar words (66.666%). So theses strings will be accepted.

-EDIT 2-

I don't think I can use ".compare()" here since it will check each character and not each written word...

LolCat
  • 539
  • 1
  • 11
  • 24
  • Your usage of `word` is sort of confusing. Are you trying to match the written word or a computer word as in 8 bytes ( 16 bits )? Furthermore even its the written word ( i.e. "dog", "cat", "horse" I see no attempt to compare the actual contents of the string. This means you must be talking about if the string matches the other string, there already exists methods for that, so just use those. – Security Hound Oct 11 '12 at 18:01
  • Is there a reason for not using .compare() ? –  Oct 11 '12 at 18:05
  • This seems like a duplicate of http://stackoverflow.com/questions/5492485/strcmp-or-stringcompare?rq=1 it sounds like you should do more research on the correct way to compare two strings in C++ – Security Hound Oct 11 '12 at 18:06
  • Please clarify how you want to count: If for example one string is "cat rock dog rock horse rock", and the other is "rock dog rock squirrel stone", how many words are the same in both strings? – anatolyg Oct 11 '12 at 18:32
  • @Ramhound I had already searched, but I didn't found what I nedded. I will still look at the link you gave me. – LolCat Oct 11 '12 at 18:33
  • @anatolyg Ok : with your example, I would have rock, dog, rock that matches. Each word will be tested once to see if it is present in the other string. Then the next word will be tested against all other words from the other string... – LolCat Oct 11 '12 at 18:36
  • @LolCat - Just use the standard library method to compare each string in the vectors. Since they must be the same size you only need one loop. Please don't check if a string is equal to one another by doing `*it == *it2` that is asking for trouble and just bad code. – Security Hound Oct 11 '12 at 18:50
  • 1
    @Ramhound I see nothing wrong in comparing strings by doing `*it == *it2`; it seems to be a perfectly standard method (using `string::operator==`) – anatolyg Oct 11 '12 at 18:54

1 Answers1

1

Use a string stream to split a string into words:

#include <sstream>

bool is_similar(string str1, string str2)
{
    vector<string> words1, words2;
    string temp;

    // Convert the first string to a list of words
    std::stringstream stringstream1(str1);
    while (stringstream1 >> temp)
        words1.push_back(temp);

    // Convert the second string to a list of words
    std::stringstream stringstream2(str2);
    while (stringstream2 >> temp)
        words2.push_back(temp);

    int num_of_identical_words = 0;
    // Now, use the code you already have to count identical words
    ...

    double ratio = (double)num_of_identical_words / words2.size();
    return ratio > 0.6;
}
anatolyg
  • 26,506
  • 9
  • 60
  • 134