Which data structure and algorithm is appropriate for this?

Question

I have 1000's of string. Given a pattern that need to be searched in all the string, and return all the string which contains that pattern.

Presently i am using vector for to store the original strings. searching for a pattern and if matches add it into new vector and finally return the vector.

int main() {
    vector <string> v;
    v.push_back ("maggi");
    v.push_back ("Active Baby Pants Large 9-14 Kg ");
    v.push_back ("Premium Kachi Ghani Pure Mustard Oil ");
    v.push_back ("maggi soup");
    v.push_back ("maggi sauce");
    v.push_back ("Superlite Advanced Jar");
    v.push_back ("Superlite Advanced");
    v.push_back ("Goldlite Advanced"); 
    v.push_back ("Active Losorb Oil Jar"); 

    vector <string> result;

    string str = "Advanced";

    for (unsigned i=0; i<v.size(); ++i)
    {
        size_t found = v[i].find(str);
        if (found!=string::npos)
            result.push_back(v[i]);
    }

    for (unsigned j=0; j<result.size(); ++j)
    {
        cout << result[j] << endl;
    }
    // your code goes here
    return 0;

}

Is there any optimum way to achieve the same with lesser complexity and higher performance ??

You can just save the index of matching string, instead of the string itself. So the `result` becomes `vector` storing only index. — Mine, Dec 25 '13 at 06:10
You can mess around with suffix trees and you'll get lower asymptotic complexity---linear time to build the thing and time proportional to the length of the query string plus the number of outputs to query. There's a rather high constant factor involved, though; at your scale, it's just not worth it. — tmyklebu, Dec 25 '13 at 06:18

score 2 · Answer 1 · edited May 23 '17 at 11:43

2

The containers I think are appropriate for your application.

However instead of std::string::find, if you implement your own KMP algorithm, then you can guarantee the time complexity to be linear in terms of the length of string + search string.
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

As such the complexity of std::string::find is unspecified.
http://www.cplusplus.com/reference/string/string/find/

EDIT: As pointed out by this link, if the length of your strings is not large (more than 1000), then probably using std::string::find would be good enough since here tabulation etc is not needed.
C++ string::find complexity

edited May 23 '17 at 11:43

Community

1
1

answered Dec 25 '13 at 06:11

Abhishek Bansal

12,589
4
31
46

Just use `strstr`. High-quality implementations have a `strstr` that runs in linear time and constant space. – tmyklebu Dec 25 '13 at 06:16
1

`strstr` is not a good choice for `std::string`, which may have embedded nulls and in any case has a known length. And there's no reason to use it, given that `std::find` does the same thing without bringing in obsolete C library functions. – Sneftel Dec 25 '13 at 06:18
@Sneftel: None of the strings OP's input have embedded nulls. Blowing up your complexity to quadratic because of embedded nulls isn't a great idea if you're looking for performance, which OP allegedly is. – tmyklebu Dec 25 '13 at 06:19
@tmyklebu why not KMP algorithm? – Abhishek Bansal Dec 25 '13 at 06:22
@AbhishekBansal: `strstr` is already there and it will probably be an order of magnitude faster than a hand-coded `strstr`. KMP's auxiliary array causes quite a severe performance degradation---for short strings, because you have to build it, and for long string because of the additional cache pressure. This is one of those things that you can just try and see for yourself, by the way. – tmyklebu Dec 25 '13 at 06:37
Why would you think that `std::string::find` has greater asymptotic complexity than `strstr`? Other than supporting embedded nulls and potentially benefiting from known lengths, the two do virtually the same thing. – Sneftel Dec 25 '13 at 07:59

tkhomas · Answer 2 · 2013-12-25T07:39:15.603

0

If the result is used in the same block of code as the input string vector (it is so in your example) or even if you have a guarantee that everyone uses the result only while input exists, you don't need actually to copy strings. It could be an expensive operation, which considerably slows total algorithm.

Instead you could have a vector of pointers as the result:

vector <string*> result;

edited Dec 25 '13 at 07:39

answered Dec 25 '13 at 07:22

tkhomas

103
1
8

Are you sure you can have a vector of references? – 6502 Dec 25 '13 at 07:25

score 0 · Answer 3 · answered Dec 25 '13 at 07:59

If the list of strings is "fixed" for many searches then you can do some simple preprocessing to speed up things quite considerably by using an inverted index.

Build a map of all chars present in the strings, in other words for each possible char store a list of all strings containing that char:

std::map< char, std::vector<int> > index;
std::vector<std::string> strings;

void add_string(const std::string& s) {
    int new_pos = strings.size();
    strings.push_back(s);
    for (int i=0,n=s.size(); i<n; i++) {
        index[s[i]].push_back(new_pos);
    }
}

Then when asked to search for a substring you first check for all chars in the inverted index and iterate only on the list in the index with the smallest number of entries:

std::vector<std::string *> matching(const std::string& text) {
    std::vector<int> *best_ix = NULL;
    for (int i=0,n=text.size(); i<n; i++) {
        std::vector<int> *ix = &index[text[i]];
        if (best_ix == NULL || best_ix->size() > ix->size()) {
            best_ix = ix;
        }
    }

    std::vector<std::string *> result;
    if (best_ix) {
        for (int i=0,n=best_ix->size(); i<n; i++) {
            std::string& cand = strings[(*best_ix)[i]];
            if (cand.find(text) != std::string::npos) {
                result.push_back(&cand);
            }
        }
    } else {
        // Empty text as input, just return the whole list
        for (int i=0,n=strings.size(); i<n; i++) {
            result.push_back(&strings[i]);
        }
    }
    return result;
}

Many improvements are possible:

use a bigger index (e.g. using pairs of consecutive chars)
avoid considering very common chars (stop lists)
use hashes computed from triplets or longer sequences
search the intersection instead of searching the shorter list. Given the elements are added in order the vectors are anyway already sorted and intersection could be computed efficently even using vectors (see std::set_intersection).

All of them may make sense or not depending on the parameters of the problem (how many strings, how long, how long is the text being searched ...).

score 0 · Answer 4 · answered Dec 25 '13 at 10:03

If the source text is large and static (e.g. crawled webpages), then you can save search time by pre-building a suffix tree or a trie data structure. The search pattern can than traverse the tree to find matches.

If the source text is small and changes frequently, then your original approach is appropriate. The STL functions are generally very well optimized and have stood the test of time.

Which data structure and algorithm is appropriate for this?

4 Answers4