Better way to detect if a string contains multiple words

Question

I am trying to create a program that detects if multiple words are in a string as fast as possible, and if so, executes a behavior. Preferably, I would like it to detect the order of these words too but only if this can be done fast. So far, this is what I have done:

if (input.contains("adsf") && input.contains("qwer")) {
    execute();          
}

As you can see, doing this for multiple words would become tiresome. Is this the only way or is there a better way of detecting multiple substrings? And is there any way of detecting order?

score 40 · Answer 1 · edited May 23 '17 at 12:10

40

I'd create a regular expression from the words:

Pattern pattern = Pattern.compile("(?=.*adsf)(?=.*qwer)");
if (pattern.matcher(input).find()) {
    execute();
}

For more details, see this answer: https://stackoverflow.com/a/470602/660143

edited May 23 '17 at 12:10

Community

1
1

answered Sep 19 '13 at 01:51

Christoph Walesch

2,337
2
32
44

2

this only handles the first word matched and returns true from what I can tell, does not match if ALL words are in the input string. – Jim Ford Apr 18 '17 at 16:03
1

I liked the use of regex here but this was very slow for me when going through a large amount of text. I found @Jack's answer to be much faster (in my use case). – uesports135 Mar 23 '18 at 13:07
1

Agree, regexps are quite expensive by themselves and you shouldn't use them if you need the highest speed. – Gaket Jun 24 '18 at 09:48
If the goal is maintainability (from the question "*doing this for multiple words would become tiresome*") then this solution does not address it. You use 6 extra characters per word. – Michael Jul 06 '20 at 11:45

score 18 · Accepted Answer · edited Jul 06 '20 at 11:36

18

Editors note: Despite being heavily upvoted and accepted, this does not function the same as the code in the question. execute is called on the first match, like a logical OR.

You could use an array:

String[] matches = new String[] {"adsf", "qwer"};

bool found = false;
for (String s : matches)
{
  if (input.contains(s))
  {
    execute();
    break;
  }
}

This is efficient as the one posted by you but more maintainable. Looking for a more efficient solution sounds like a micro optimization that should be ignored until proven to be effectively a bottleneck of your code, in any case with a huge string set the solution could be a trie.

edited Jul 06 '20 at 11:36

Michael

41,989
11
82
128

answered Sep 19 '13 at 01:49

Jack

131,802
30
241
343

Hmm, I believe that this should work well for my little project. Thanks for such the fast reply! – Silver Sep 19 '13 at 02:01
8

Does this actually work the same as code in question? This one should work more like an or operator. – aycanadal Dec 17 '14 at 00:43
1

for large word sets to match one option is the aho-corasick algorithm try this lib -> https://github.com/robert-bor/aho-corasick – Jim Ford Apr 21 '17 at 14:30
Quick performance improvement is to replace the for-each with a for-i loop. For-each in Java creates an iterator object. Object creation is expensive. If you are optimizing for code to execute within 200ms, the optimization is not worth it. However, if you are optimizing for more critical performance, avoiding object creation makes a huge difference. – Thomas Fischer Mar 01 '19 at 14:16
@ThomasFischer: if code is time critical then the problem is in the algorithm per se, a faster approach (like a Bayer-Moore optimized for multiple strings) will be a proper solution. – Jack Mar 01 '19 at 15:01

score 13 · Answer 3 · edited Jul 06 '20 at 11:41

13

In Java 8 you could do

public static boolean containsWords(String input, String[] words) {
    return Arrays.stream(words).allMatch(input::contains);
}

Sample usage:

String input = "hello, world!";
String[] words = {"hello", "world"};
if (containsWords(input, words)) System.out.println("Match");

edited Jul 06 '20 at 11:41

Michael

41,989
11
82
128

answered Jun 24 '18 at 08:27

Linus

880
13
25

score 3 · Answer 4 · answered Mar 01 '19 at 14:14

This is a classical interview and CS problem.

Robin Karp algorithm is usually what people first talk about in interviews. The basic idea is that as you go through the string, you add the current character to the hash. If the hash matches the hash of one of your match strings, you know that you might have a match. This avoids having to scan back and forth into your match strings. https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

Other typical topics for that interview question are to consider a trie structure to speed up the lookup. If you have a large set of match strings, you have to always check a large set of match strings. A trie structure is more efficient to do that check. https://en.wikipedia.org/wiki/Trie

Additional algorithms are: - Aho–Corasick https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm - Commentz-Walter https://en.wikipedia.org/wiki/Commentz-Walter_algorithm

score 1 · Answer 5 · answered Sep 19 '13 at 01:53

If you have a lot of substrings to look up, then a regular expression probably isn't going to be much help, so you're better off putting the substrings in a list, then iterating over them and calling input.indexOf(substring) on each one. This returns an int index of where the substring was found. If you throw each result (except -1, which means that the substring wasn't found) into a TreeMap (where index is the key and the substring is the value), then you can retrieve them in order by calling keys() on the map.

Map<Integer, String> substringIndices = new TreeMap<Integer, String>();
List<String> substrings = new ArrayList<String>();
substrings.add("asdf");
// etc.

for (String substring : substrings) {
  int index = input.indexOf(substring);

  if (index != -1) {
    substringIndices.put(index, substring);
  }
}

for (Integer index : substringIndices.keys()) {
  System.out.println(substringIndices.get(index));
}

score 1 · Answer 6 · answered Jul 24 '16 at 06:03

Use a tree structure to hold the substrings per codepoint. This eliminates the need to

Note that this is efficient only if the needle set is almost constant. It is not inefficient if there are individual additions or removals of substrings though, but a different initialization each time to arrange a lot of strings into a tree structure would definitely slower it.

`StringSearcher`:

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.HashMap;

class StringSearcher{
    private NeedleTree needles = new NeedleTree(-1);
    private boolean caseSensitive;
    private List<Integer> lengths = new ArrayList<>();
    private int maxLength;

    public StringSearcher(List<String> inputs, boolean caseSensitive){
        this.caseSensitive = caseSensitive;
        for(String input : inputs){
            if(!lengths.contains(input.length())){
                lengths.add(input.length());
            }
            NeedleTree tree = needles;
            for(int i = 0; i < input.length(); i++){
                tree = tree.child(caseSensitive ? input.codePointat(i) : Character.toLowerCase(input.codePointAt(i)));
            }
            tree.markSelfSet();
        }
        maxLength = Collections.max(legnths);
    }

    public boolean matches(String haystack){
        if(!caseSensitive){
            haystack = haystack.toLowerCase();
        }
        for(int i = 0; i < haystack.length(); i++){
            String substring = haystack.substring(i, i + maxLength); // maybe we can even skip this and use from haystack directly?
            NeedleTree tree = needles;
            for(int j = 0; j < substring.maxLength; j++){
                tree = tree.childOrNull(substring.codePointAt(j));
                if(tree == null){
                    break;
                }
                if(tree.isSelfSet()){
                    return true;
                }
            }
        }
        return false;
    }
}

`NeedleTree.java`:

import java.util.HashMap;
import java.util.Map;

class NeedleTree{
    private int codePoint;
    private boolean selfSet;
    private Map<Integer, NeedleTree> children = new HashMap<>();

    public NeedleTree(int codePoint){
        this.codePoint = codePoint;
    }

    public NeedleTree childOrNull(int codePoint){
        return children.get(codePoint);
    }

    public NeedleTree child(int codePoint){
        NeedleTree child = children.get(codePoint);
        if(child == null){
            child = children.put(codePoint, new NeedleTree(codePoint));
        }
        return child;
    }

    public boolean isSelfSet(){
        return selfSet;
    }

    public void markSelfSet(){
        selfSet = true;
    }
}

score -1 · Answer 7 · edited Apr 11 '19 at 12:49

-1

I think a better approach would be something like this, where we can add multiple values as a one string and by index of function validate index

String s = "123"; 
System.out.println(s.indexOf("1")); // 0
System.out.println(s.indexOf("2")); // 1 
System.out.println(s.indexOf("5")); // -1

edited Apr 11 '19 at 12:49

chb

1,727
7
25
47

answered Apr 11 '19 at 11:40

Virendra khade

89
2

Better way to detect if a string contains multiple words

7 Answers7

StringSearcher:

NeedleTree.java:

`StringSearcher`:

`NeedleTree.java`: