Efficient way to search for a set of strings in a string in Java

Question

I have a set of elements of size about 100-200. Let a sample element be X.

Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}

For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.

X is present in P iff for all s belong to X, s is a substring of P.

The set of elements is available for pre-processing.

I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:

Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring

Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.

Please note that the set of elements is available for pre-processing and I want the solution in java.

Do you expect there to be much duplication of elements across the sets, or should they be mostly unique? — codebox, Sep 11 '12 at 09:53
@codebox, There will be some duplication across sets. Maybe 1 string will be common to at most 10 sets or so. Can I exploit this property somehow? — BiGYaN, Sep 11 '12 at 14:41

score 2 · Accepted Answer · answered Sep 11 '12 at 09:50

2

You can use regex directly:

Pattern regex = Pattern.compile(
    "^               # Anchor search to start of string\n" +
    "(?=.*s1)        # Check if string contains s1\n" +
    "(?=.*s2)        # Check if string contains s2\n" +
    "(?=.*s3)        # Check if string contains s3", 
    Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

foundMatch is true if all three substrings are present in the string.

Note that you might need to escape your "needle strings" if they could contain regex metacharacters.

answered Sep 11 '12 at 09:50

Tim Pietzcker

328,213
58
503
561

Thanks for your suggestion. This will be my initial approach. – BiGYaN Sep 12 '12 at 05:46
This approach turns out to be _fast enough_ for my purposes. Thanks. – BiGYaN Sep 25 '12 at 06:08

Dunes · Answer 2 · 2012-09-11T11:00:00.557

It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.

The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.

My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.

Thanks for your suggestion. I will try out this fail fast approach with the redundancy factor which @codebox hinted at. — BiGYaN, Sep 12 '12 at 05:43

Zar Shardan · Answer 3 · 2013-04-28T03:33:15.413

1

Looks like a perfect case for the Rabin–Karp algorithm:

Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.

edited Apr 28 '13 at 03:33

answered Apr 28 '13 at 03:27

Zar Shardan

5,675
2
39
37

score 0 · Answer 4 · answered Sep 11 '12 at 10:02

When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.

The algorithm to index a string would look like that (untested):

HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();

for (int pos = 0; pos < string.length(); pos++) {
    for (int sublen=0; sublen < string.length-pos; sublen++) {
         String substring = string.substr(pos, sublen);
         Set<String> stringsForThisKey = indexes.get(substring);
         if (stringsForThisKey == null) {
             stringsForThisKey = new HashSet<String>();
             indexes.put(substring, stringsForThisKey);
         }
         stringsForThisKey.add(string);
    }
}

Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.

But the result would be constant-speed access to the list of strings in which a specific string occurs.

I believe the pre-processing the OP is after is on the set of string (dictionary) `X`, and not on the input string. — amit, Sep 11 '12 at 10:22

score 0 · Answer 5 · answered Sep 11 '12 at 10:18

0

You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.

answered Sep 11 '12 at 10:18

amit

175,853
27
231
333

Thanks for your suggestion. I'm aware of Aho-Corasick algo. I was under the impression that regex will internally use something like this internally. BTW, any libraries that you can refer me on this? – BiGYaN Sep 12 '12 at 05:47
@BiGYaN: regex does NOT use this algorithm AFAIK. actually, in java at least - [regex might decay to exponential running time](http://stackoverflow.com/questions/8887724/why-can-regular-expressions-have-an-exponential-running-time) in some cases. I am unfamiliar with libraries that implement the algorithm, but the wiki page refers to [this implementation](https://hkn.eecs.berkeley.edu/~dyoo/java/index.html) under the "external links" section. – amit Sep 12 '12 at 05:54
if I use only regular expressions (recognizable by a Finite Automata) then something like Thompson NFA or Aho-Corasick should be the fastest implementation ... right? I was under the impression that something as widely used as java should be having these algorithms implemented. then I read http://swtch.com/~rsc/regexp/regexp1.html – BiGYaN Sep 12 '12 at 09:39
The thing is these edge cases are very rare. However, I believe using Thompson or Aho-Corasick will most likely be the most efficient approach for your problem. – amit Sep 12 '12 at 10:02
Thanks for the insight. Now I'll need to find some good implementation for either of them. If not, then I will have to code it up. :( – BiGYaN Sep 12 '12 at 11:38
@BiGYaN: The wikipedia page, under external links, links to this: https://hkn.eecs.berkeley.edu/~dyoo/java/index.html – amit Sep 12 '12 at 11:40
The regex based solution proved to be good enough for my purposes. I have my reservations against using uncommon/unknown libraries within production code. – BiGYaN Sep 25 '12 at 06:10

Peter Lawrey · Answer 6 · 2012-09-12T07:01:27.770

One way is to generate every possible substring and add this to a set. This is pretty inefficient.

Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.

static class SubstringMatcher {
    final NavigableSet<String> set = new TreeSet<String>();

    SubstringMatcher(Set<String> strings) {
        for (String string : strings) {
            for (int i = 0; i < string.length(); i++)
                set.add(string.substring(i));
        }
        // remove duplicates.
        String last = "";
        for (String string : set.toArray(new String[set.size()])) {
            if (string.startsWith(last))
                set.remove(last);
            last = string;
        }
    }

    public boolean findIn(String s) {
        String s1 = set.ceiling(s);
        return s1 != null && s1.startsWith(s);
    }
}

public static void main(String... args) {
    Set<String> strings = new HashSet<String>();
    strings.add("hello");
    strings.add("there");
    strings.add("old");
    strings.add("world");
    SubstringMatcher sm = new SubstringMatcher(strings);
    System.out.println(sm.set);
    for (String s : "ell,he,ow,lol".split(","))
        System.out.println(s + ": " + sm.findIn(s));
}

prints

[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

Thanks for your suggestion. I haven't finalized on the approach. If not anything else, I learnt about `NavigableSet`. — BiGYaN, Sep 12 '12 at 05:45

score 0 · Answer 7 · answered Sep 11 '12 at 11:22

0

You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here

I have used proprietary implementations (that I no longer even have access to) and they are very fast.

answered Sep 11 '12 at 11:22

JoeG

7,191
10
60
105

If you are referring to Ukonnen's algorithm, I am not aware of any standard implementations of it. Its a fairly complicated algo, and I would not like to code it up OR use some not-so-well-known source code. – BiGYaN Sep 12 '12 at 06:10

Efficient way to search for a set of strings in a string in Java

7 Answers7