The most efficient way to search for an array of strings in another string

Question

I have a large arrray of strings that looks something like this: String temp[] = new String[200000].

I have another String, let's call it bigtext. What I need to do is iterate through each entry of temp, checking to see if that entry is found in bigtext and then do some work based on it. So, the skeletal code looks something like this:

for (int x = 0; x < temp.length; x++) {
  if (bigtext.indexOf(temp[x]) > -1 {

  //do some stuff
  } else continue;
}

Because there are so many entries in temp and there are many instances of bigtext as well, I want to do this in the most efficient way. I am wondering if what I've outlined is the most efficient way to iterate through this search of if there are better ways to do this.

Thanks,

Elliott

that's 200 thousands Strings! Search engine, reverse index.... not sure if I am making sense though (: — Nishant, Mar 06 '12 at 13:46
Hmm, if it's string comparison you could first check the length of the two strings and if equal then compare the strings starting from the last character if your strings are likely to be very similar but differ in the last few characters. — Aram Kocharyan, Mar 06 '12 at 13:49
Are you looking for an algorithm? What measure of efficiency will you be using? — OlduwanSteve, Mar 06 '12 at 13:52

score 15 · Accepted Answer · answered Mar 06 '12 at 14:03

15

I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.

answered Mar 06 '12 at 14:03

Dave R

166
3

Rabin-Karp does not appear to be worth while using based on some experimental results done by another member and me. See http://stackoverflow.com/questions/9741188/java-indexof-function-more-efficient-than-rabin-karp-search-efficiency-of-text – Elliott Mar 17 '12 at 01:05

score 10 · Answer 2 · edited Mar 06 '12 at 22:31

10

Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).

By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!

Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.

edited Mar 06 '12 at 22:31

Peter Mortensen

30,738
21
105
131

answered Mar 06 '12 at 14:13

amit

175,853
27
231
333

I think the current complexity is probably actually `O(|S1|*n*|S2|)`. At least, my JDK's `String.indexOf(String)` has complexity `O(|S1|*|S2|)`; I think one could be designed that converts the argument into an FSM and has lower asymptotic complexity, but in practice that's not done. – ruakh Mar 06 '12 at 18:33
Suffix array > suffix tree due to cache locality. But a trie will in many circumstances be faster. – Konrad Rudolph Mar 07 '12 at 00:29

score 8 · Answer 3 · edited Mar 06 '12 at 22:30

8

If you have additional information about temp, you can maybe improve the iteration.

You can also reduce the time spent, if you parallelize the iteration.

edited Mar 06 '12 at 22:30

Peter Mortensen

30,738
21
105
131

answered Mar 06 '12 at 13:45

Hachi

3,237
1
21
29

unless of course the `do some stuff` is dependent on previous iterations. If it isn't - parallelizing is definetly a great idea. – amit Mar 06 '12 at 13:51
1

Parallelization could be a benefit, but we really don't know if they even have the extra cores to take advantage of it. There is a cost in coordinating the threads, and if there is only one effective core available, parallelization may be a costly introduction of overhead. – Edwin Buck Mar 06 '12 at 13:56
parallelization on 200 000 elements can improve the speed even in a single core environment; anyway it's worth a try – Hachi Mar 06 '12 at 14:00
1

I agree with @Hachi, even with one core, having multiple threads running - every time there is pagefault, and a page needs to be reloaded from disk to memory - with multi threaded system - other threads just keep working while waiting for the disk, while with single-threaded, the program just suspends and waits for the page to be loaded back to memory. – amit Mar 06 '12 at 14:04
1

@amit, you are correct, but perhaps you are not considering that with a bunch of threads accessing different parts of memory, you are likely to introduce more pagefaults than a single thread which is written to be cache-friendly. – Edwin Buck Mar 06 '12 at 17:02

Edwin Buck · Answer 4 · 2012-03-06T17:00:28.893

Efficency depends heavily on what is valuable to you.

Are you willing to increase memory for reduced time? Are you willing to increase time for efficent handling of large data sets? Are you willing to increase contention for CPU cores? Are you willing to do pre-processing (perhaps one or more forms of indexing) to reduce the lookup time in a critical section.

With you offering, you indicate the entire portion you want made more efficent, but that means you have excluded any portion of the code or system where the trade-off can be made. This forces one to imagine what you care about and what you don't care about. Odds are excellent that all the posted answers are both correct and incorrect depending on one's point of view.

score 3 · Answer 5 · answered Mar 06 '12 at 13:52

An alternative approach would be to tokenize the text - let's say split by common punctuation. Then put these tokens in to a Set and then find the intersect with the main container.

Instead of an array, hold the words in a Set too. The intersection can be calculated by simply doing

bidTextSet.retainAll(mainWordsSet);

What remains will be the words that occur in bigText that are in your "dictionary".

score 3 · Answer 6 · edited Mar 06 '12 at 22:36

3

Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.

edited Mar 06 '12 at 22:36

Peter Mortensen

30,738
21
105
131

answered Mar 06 '12 at 13:55

Orlymee

2,349
1
22
24

Boyer-Moore is *not* an algorithm that is suitable for multi-pattern search. And in particular if the patterns are short, this will be *slower* than the naive search. – Konrad Rudolph Mar 07 '12 at 00:28

score 2 · Answer 7 · edited Mar 06 '12 at 22:33

2

I'm afraid it's not efficient at all in any case!

To pick the right algorithm, you need to provide some answers:

What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
Are you actually searching for words? If so, index them. Bloom filter can help, too.
If you need a bit of fuzziness, may stem or soundex can do the job?

Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.

edited Mar 06 '12 at 22:33

Peter Mortensen

30,738
21
105
131

answered Mar 06 '12 at 21:43

YvesgereY

3,778
1
20
19

1

Who the f*** downvoted the best answer so far? (Except for the first line … that is really just misleading since these methods can be up to several orders of magnitude faster than the naive approach). – Konrad Rudolph Mar 07 '12 at 00:30

score 1 · Answer 8 · edited Mar 06 '12 at 13:57

1

That is a very efficient approach. You can improve it slightly by only evaluating temp.length once

for(int x = 0, len = temp.length; x < len; x++)

Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.

edited Mar 06 '12 at 13:57

Tomasz Nurkiewicz

334,321
69
703
674

answered Mar 06 '12 at 13:44

Johan Sjöberg

47,929
21
130
148

are you sure this is an improvement in java? i don't know for arrays, but Java iterates faster over ArrayLists e.g. if the condition is x < list.size() – Hachi Mar 06 '12 at 13:47
3

I doubt it will have any effect, temp.length is a just field, it is accessed immediately - and the jvm will probably optimize the double referencing anyway [only reason for improvement I can think of] – amit Mar 06 '12 at 13:48
1

@Hachi, `List` objects are different, there `size()` runs in constant time `O(1)` while `array.length` runs in `O(n)`. Then it's possible a clever compiler can perform some optimizations .. – Johan Sjöberg Mar 06 '12 at 13:48
1

Seems it has been discussed before: http://stackoverflow.com/questions/6626856/performance-wise-is-it-better-to-call-the-length-of-a-given-array-every-time-or and http://stackoverflow.com/questions/8458187/is-there-a-difference-in-performance-for-calling-length-on-an-array-versus-savi – assylias Mar 06 '12 at 13:51
3

@Johan why should `array.length` run in O(n)? it's a final integer attribute, just like `list.size()` is a getter on a normal integer attribute – Hachi Mar 06 '12 at 13:54
@Hachie, that's what I was taught long ago when I first learned the language. This seems like a good time to reconsider if those teaching were plain wrong ... – Johan Sjöberg Mar 06 '12 at 13:58
`array.length` is most certainly not O(n). Arrays in Java have a fixed length that is set at the time they're created. `size()` on a `List` implementation, on the other hand, is not guaranteed to run in constant time... though I don't know of any implementations of `List` where it doesn't. – ColinD Mar 06 '12 at 13:59
@ColinD, ofc the `Array` source code isn't that easy to come by, but I believe you are right – Johan Sjöberg Mar 06 '12 at 14:03
@JohanSjöberg: As amit mentioned, `length` on an array is a field, not a method. Field access is constant time, where method run time is dependent on the method implementation. For an apparent field access to not have the behavior characteristics of a field access would be very surprising, and Java makes a point of avoiding surprising behavior. – ColinD Mar 06 '12 at 14:31

The most efficient way to search for an array of strings in another string

8 Answers8

Linked