Better solution of finding a pattern of a string?

Question

I am trying to find an optimal way to find a pattern of a string and compare. For example, I have s1 = "red blue blue red red yellow", and s2 = "abbaac". This would match because they have the same pattern.

My thinking of doing this would be iterate through s1 and s2, use a vector container to record the corresponding place's count (for s1 would be corresponding word's count, and for s2 would be corresponding letter's count) and then compare.

This is really inefficient because I iterator through the whole s1 and s2. If s1 = "red blue red red red yellow" and s2 = "abbaac". After the third red, there is essentially no point to keep iterating it through.

So, any better idea on how to do this?

Code:

#include "stdafx.h"
#include <iostream>
#include <string>
#include <array>
#include <sstream>
#include <vector>
#include <algorithm>
using namespace std;

vector<int> findPattern(string pattern){
    vector<int> counts;
    for (int i = 0; i < pattern.size(); ++i){
        counts.push_back(0);
        int counter = 0;
        for (int j = i + 1; j < pattern.size(); ++j){
            if (pattern[i] == pattern[j]){
                ++counter;              
            }   
            counts[i] = counter;    
        }
    }
    return counts;
}

vector<int> findPatternLong(string pattern){
    istringstream iss (pattern);
    string word;
    vector<string> v;
    while (iss >> word){
        v.push_back(word);
    }
    vector<int> counts2;
    for (int i = 0; i < v.size(); ++i){
        counts2.push_back(0);
        int counter = 0;
        for (int j = i + 1; j < v.size(); ++j){
            if (v[i] == v[j]){
                ++counter;
            }
            counts2[i] = counter;
        }
    }
    return counts2;
}

int main(int argc, char * argv[]){
    vector<int> v1 = findPattern("abbaac");
    vector<int> v2 = findPatternLong("red blue blue red red yellow");
    if (v1.size() == v2.size()){
        for (int i = 0; i < v1.size(); ++i){
            if (v1[i] != v2[i]){
                cout << "Unmatch" << endl;
                return false;
            }
        }
        cout << "match" << endl;
        return true;
    } else 
        cout << "Unmatch" << endl; 
    return 0;
}

score 2 · Answer 1 · answered Mar 06 '13 at 02:59

@Tony beat me with same idea, but since I already typed this, here it goes :-)

First of all, don't worry so much about efficiency and focus on correctness: indeed, premature optimization is the root of all evil. Write test cases and make sure your code passes each one.

Second, I think I would start with a maps/dictionary D, and have a loop in which I'd parse one element of each string (a word in s1, let's call it "w" and a character in your s2, say "c"), choose one element as the key (say the "c" characters) and check if "c" already has an entry in the dictionary:

If we ran out of elements at the same time, the strings match
If we ran out of elements on one side, we know there's no match
If "c" doesn't have an entry in D, store the current values: D[c] = w;
else if "c" already has an entry, check if the entry matches the value found on the string: is D[c] == w? If it doesn't we know there's no match

If that code works, then optimization could start. In your example, maybe we could use a simple array instead of a dictionary because ASCII characters are a small finite set.

Tony Delroy · Accepted Answer · 2013-03-07T02:45:20.187

1

It's not the most efficient code, but close to simplest:

std::map<char, std::string> letter_to_word;
std::set<std::string> words_seen;
std::istringstream iss(s1);
std::string word;
for (std::string::size_t i = 0; i < s2.size(); ++i)
{
    if (!(iss >> word))
        return false; // more letters than words
    std::string& expected_word = letter_to_word[s2[i]];
    if (expected_word == "")
    {
        // if different letters require different words...
        if (words_seen.find(word) != words_seen.end())
            return false; // multiple letters for same word
        words_seen.insert(word);

        expected_word = word; // first time we've seen letter, remember associated word
    }
    else if (expected_word != word)
        return false; // different word for same letter
}
return !(iss >> word); // check no surplus words

edited Mar 07 '13 at 02:45

answered Mar 06 '13 at 02:52

Tony Delroy

102,968
15
177
252

Why oh why does stackoverflow have such trouble preserving code indentation? :-/ – Tony Delroy Mar 06 '13 at 02:53
Can you explain a little more about these two lines? 1, std::string& mapping = words[s2[i]]; And why do you need to make it a reference? 2, return !(iss >> word); since, you already check it once, so this line makes sure there are no more words than letters? – HoKy22 Mar 06 '13 at 22:46
@HoKy22: thanks for pointing that aspect out; strictly it's unspecified in the question whether the letter pattern is meant to enforce differences as well as sameness (e.g. whether "ab" can match any two words, or just two different words), but I agree it's more intuitive on balance to enforce differences. I'll add edits above.... – Tony Delroy Mar 07 '13 at 02:41

score 0 · Answer 3 · answered Mar 06 '13 at 02:41

You don't need two vectors.

When processing the second string, compare the count of the first pattern, to the first entry. If it matches, keep going otherwise stop. Repeat for the rest of the patterns in the second string.

You don't need to store the pattern counts of the second string.

score 0 · Answer 4 · edited May 23 '17 at 12:21

Basically, you want to check that the sequence follows the same order. You're not worried about what the sequence actually is: first second first first third is good enough. Now, you could do this with a container that maps a string to an int in some way. However, you would be storing copies of each string and you're ignoring the fact that you don't really care about string values. For tiny test cases, this wouldn't matter, but for a large sequence of long words, you're quickly chewing up memory when you don't need to.

So let's use the fact that we don't care about the string values or about storing them. If that's the case, we can use a hash function to transform our strings to simple size_t values with a fairly strong guarantee that they're going to be unique. However, the hashes are not sequential and we will need to retrieve the sequence based on the hash value. The simplest way to record their sequence is to map them to the size of the map for easy lookup. The last piece of the puzzle is to check that the hashes are in the same sequence.

I'm also assuming that you don't just want to compare a sentence with a word, but maybe 2 words or two sentences. Here's a quick C++11 sample that basically does the above and doesn't hold anything in memory unless it needs to.

Ofcourse, this can still be optimized more - for example, executing things parallel.

#include <iostream>
#include <vector>
#include <string>
#include <map>
#include <sstream>
/*
s1 = "red blue blue red red yellow"
s2 = "abbaac"
This would match because they have the same pattern.
*/
typedef std::map<size_t,size_t> hash_map;
typedef std::vector<std::string> wordlist;

size_t ordered_symbol( hash_map &h, std::string const& word )
{
    std::hash<std::string> hash_fn;
    size_t hash = hash_fn(word);
    if(h.find(hash)==h.end())
    {
        size_t const sequence = h.size();
        h[hash] = sequence;
        return sequence;
    }
    return h[hash];
}

wordlist create_wordlist( std::string const& str )
{
    if(str.find_first_of(' ') != std::string::npos)
    {
        wordlist w1;
        std::stringstream sstr(str);
        std::string s;
        while(sstr>>s)
            w1.push_back(s);
        return w1;        
    }
    wordlist w2;
    for(auto i : str)
    {
        std::string s;
        s.append(1,i);
        w2.push_back(s);
    }
    return w2;
}

bool pattern_matches( std::string const& s1, std::string const& s2 )
{
    wordlist const w1 = create_wordlist(s1);
    wordlist const w2 = create_wordlist(s2);
    if(w1.size()!=w2.size())
        return false;
    hash_map h1,h2;
    for( size_t i = 0; i!=w1.size(); ++i)
        if(ordered_symbol(h1,w1[i])!=ordered_symbol(h2,w2[i]))
            return false;
    return true;
}

void test( std::string const& s1, std::string const& s2 )
{
    std::cout<<"["<<s1<<"] "
             <<(pattern_matches(s1,s2)? "<==>" : "<=!=>")
             <<"["<<s2<<"]\n";    
}

int main()
{
    test("red blue blue red red yellow","abbaac");
    test("red blue blue red red yellow","first second second first first third");
    test("abbaac","12211g");
    test("abbaac","red blue blue red red yellow");
    test("abbgac","red blue blue red red yellow");
    return 0;
}

//Output:
//[red blue blue red red yellow] <==>[abbaac]
//[red blue blue red red yellow] <==>[first second second first first third]
//[abbaac] <==>[12211g]
//[abbaac] <==>[red blue blue red red yellow]
//[abbgac] <=!=>[red blue blue red red yellow]

EDIT: Here's a non C++11 version that should work on VS2010. However, since C++03 does not include a string hash function in the standard library, this example uses a hash function taken from stack overflow. A much better hash function to use would be this one if you have access to the boost libraries.

I just compiled your program, looks like there is a fatal error on this line on Visual Studio 2010: for(auto i : str) — HoKy22, Mar 06 '13 at 22:02
Works for me on Visual Studio 2012 and ideone: http://ideone.com/T2D4xT. Didn't see you were using VS 2010. VS 2010 does not support C++11, so yes, it will not work there. It will however work on VS 2012 and the latest version of the Clang and GCC compilers. — Carl, Mar 06 '13 at 22:59
"fairly strong guarantee that they're going to be unique" - ummm, no. Have you seen how weak Microsoft's `std::hash` is? For example, `hash("abcdefghijklmnopqrstuvwxyz") == hash("aXcdefghijklmnopqrstuvwxyz") == hash("abXdefghijklmnopqrstuvwxyz")` because max 10 evenly-spaced characters are incorporated. Writing programs that don't work reliably is just ugly. Even a cryptographic 32 bits gets collisions given many strings: chance is linear with #words already hashed + your reason for hashing was "for a large sequence". To save memory, just reference input strings via ptr,len. — Tony Delroy, Mar 08 '13 at 02:53
Well, like I mentioned, you could use boost::hash or just about any other function that maps a string to an integer. Heck, you could use a map from a string to the size of the map - this will guarantee uniqueness too, but that's not the point. Forest for the trees. — Carl, Mar 08 '13 at 04:39

Moo-Juice · Answer 5 · 2013-03-06T21:14:52.047

EDIT

I just read that the question had the patterns in a string and this answer pertains to comparing collections of varying types. I suppose the answer still holds a little water if the 2 input strings were first converted :)

I would not say this is the most efficient solution, but I like how it is extensible.

Firstly, there is the PatternResult class. It stores the result of a pattern:

class PatternResult {
private:
    std::vector<int> result_;

public:
    PatternResult(const std::vector<int>& result) : result_(result) {
    };

    bool operator == (const PatternResult& rhs) const {
        if(result_.size() != rhs.result_.size()) 
            return false;
        else {
            for(std::vector<int>::size_type r(0);
                r < result_.size();
                ++r) {
                if(result_[r] != rhs.result_[r])
                    return false;
            };
            return true;
        };
    };
};  // eo class PatternResult

It takes a vector of integers, the value of which denotes it's value. We overload == to compare two pattern results, meaning they have the same sequence irrespective of the source data.

Then we need a pattern counter that can assign the same sequence numbers, but take any type:

template<class T>
class PatternCounter {
private:
    typedef std::vector<T> vec_type;
    typedef std::map<T, int> map_type;
    map_type found_;
    int counter_;
public:
    PatternCounter() : counter_(1) {
    };

    PatternResult count(const vec_type& input ){
        std::vector<int> ret;
        for(vec_type::const_iterator cit(input.begin());
            cit != input.end();
            ++cit) {
            if(found_.find(*cit) != found_.end()) {
                ret.push_back(found_[*cit]);
            } else {
                found_[*cit] = counter_;
                ret.push_back(counter_);
                ++counter_;
            };
        };
        return PatternResult(ret);
    };
};

And we're done. Test code:

std::vector<std::string> inp1;
inp1.push_back("red");
inp1.push_back("blue");
inp1.push_back("blue");
inp1.push_back("red");
inp1.push_back("yellow");

std::vector<char> inp2;
inp2.push_back('a');
inp2.push_back('b');
inp2.push_back('b');
inp2.push_back('a');
inp2.push_back('c');

PatternCounter<std::string> counter1;
PatternCounter<char> counter2;

PatternResult res1(counter1.count(inp1));
PatternResult res2(counter2.count(inp2));

if(res1 == res2) {
        // pattern sequences are equal
};

Note this was quick and dirty, I am sure it could be made more efficient.

The idea of using OOP to do was very cool, but it does look pretty long — HoKy22, Mar 06 '13 at 22:39

Better solution of finding a pattern of a string?

5 Answers5