Check if a string is available in a bigger string of length 100,000

Question

Possible Duplicate:
What is the fastest substring search algorithm?

How do I check if a string is present in a bigger string of length of 100,000 characters in C++ or Java?

I know a method str.find("sub_string"); but it can't handle such a big string. The max execution time is 1 sec.

Also the sub strings I need to look for can be 50,000!

Which do you want? C++ or Java or both? The answers will be different for each. — Code-Apprentice, Sep 21 '12 at 18:30
There is no way you can expect any algorithm to perform in under one second to do this. Execution time must be somewhere in O(n) to O(n^2), or worse. — Drise, Sep 21 '12 at 18:30
Wondering where do you find a string of 100,000 length and of what use? — A Developer, Sep 21 '12 at 18:31
Why is this an SO question? Just write code that does what you need using whatever [string search algorithm](http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) you like. As asked, this is way too vague to answer because it doesn't give us nearly enough information to choose the appropriate algorithm. (Does the substring change from search to search? Does the string? Is the string evenly distributed characters? Does the substring contain rare inner substrings? And so on.) — David Schwartz, Sep 21 '12 at 18:31
@Drise It's definitely `O(n)`. But 100,000 should be doable in under 1 sec regardless. — Mysticial, Sep 21 '12 at 18:34
@Mysticial: Actually (using one of the Boyer-Moore variants, for example) this can be sub-linear (e.g., in best case, the number of comparisons is N/M, where N is length of haystack and M is length of needle). — Jerry Coffin, Sep 21 '12 at 18:37
also my test cases can be 50,000!! what should i do in such case? — Rajat, Sep 21 '12 at 19:01

David Schwartz · Answer 1 · 2012-09-21T18:47:04.143

5

In C or C++, you can just use malloc to get a chunk of 100,000 bytes. Fill it with your data. To find a needle in a haystack, you can use the following code:

void *mem_mem(void *haystack, int haystack_len, void *needle, int needle_len)
{
  const char *begin;
  const char *const last_possible
    = (const char *) haystack + haystack_len - needle_len;

  if (needle_len == 0)
    return (void *) &((const char *) haystack)[needle_len - 1];

  for (begin = (const char *) haystack; begin <= last_possible; ++begin)
    if (begin[0] == ((const char *) needle)[0] &&
    !memcmp ((const void *) &begin[1],
         (const void *) ((const char *) needle + 1),
         needle_len - 1))
      return (void *) begin;

  return NULL;
}

On any reasonably modern platform, this will find any substring in 100,000 bytes in a tiny fraction of a second. You can modify it to use char * types trivially. If you do multiple searches in the same haystack, try to only compute the haystack length once. Don't call strlen when you don't need to.

This will be horribly sub-optimal if your haystack contains many repetitions of the first character or characters of your needle. For example, searching for "ab" in "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaqaaaa.." (or worse, "abc" in "abababababababab...abc...") will be slow. But you didn't give nearly enough details for us to determine the optimum implementation.

It's entirely possible that the point of the question is to write the algorithm with the best possible worst case performance. If so, this is probably not the "right" answer. One can imagine a haystack of all a's followed by a single b at the end and a needle consisting of all a's followed by a single b at the end. In that case, this algorithm might need a very long time.

edited Sep 21 '12 at 18:47

answered Sep 21 '12 at 18:35

David Schwartz

179,497
17
214
278

1

For one terrible question, I think this is the most reasonable answer, especially the repeated characters bit. – Drise Sep 21 '12 at 18:39
2

I suspect an online judge, so the test cases would probably include something like `needle = a^9999b, haystack=a^100000` which would not be doable within the time limit with the naive algorithm. – Daniel Fischer Sep 21 '12 at 18:40
1

@DanielFischer Good point. It's entirely possible that the point of the question is to write the algorithm with the best possible worst case performance. If so, this is probably not the "right" answer. – David Schwartz Sep 21 '12 at 18:41
You should incorporate [your comment](http://stackoverflow.com/questions/12536157/check-if-a-string-is-available-in-a-bigger-string-of-length-100-000#comment16879474_12536157) into the answer to cover that case too :D – Daniel Fischer Sep 21 '12 at 18:45
@DanielFischer: You're right. Done. – David Schwartz Sep 21 '12 at 18:47
also my test cases can be 50,000!! what should i do in such case? – Rajat Sep 21 '12 at 19:03
@r20rock: Can be 50,000 what? – David Schwartz Sep 21 '12 at 19:31

score 4 · Accepted Answer · answered Sep 21 '12 at 18:33

This completes almost instantly (4ms) on my modest 1st gen Intel iMac. I put the search string between two blocks of 100,000 characters in case java searches backwards.

StringBuilder builder = new StringBuilder();
for (int i = 0; i < 100000; i++) {
    builder.append((char) i);
}
builder.append("sub_string");
for (int i = 0; i < 100000; i++) {
    builder.append((char) i);
}
String maxString = builder.toString();
long t1 = System.currentTimeMillis();
System.out.println(maxString.contains("sub_string"));
long t2 = System.currentTimeMillis();
System.out.println(t2 - t1);

Output

true
4

score 1 · Answer 3 · answered Sep 21 '12 at 18:29

1

Assuming java:

String.contains("stringtofind");

Is one way to find if string exists with in another string, javadoc.

answered Sep 21 '12 at 18:29

kosa

65,990
13
130
167

2

@r20rock: I didn't test it with big string. But, javadoc didn't specify any limits, so it should work. – kosa Sep 21 '12 at 18:33
1

Adding more ? does not help you get answers faster or better. Just makes me want to troll you hard. – Drise Sep 21 '12 at 18:34
1

keep in mind that..string.contains() is case sensitive check – A Developer Sep 21 '12 at 18:34

Subhrajyoti Majumder · Answer 4 · 2012-09-21T18:37:17.103

1

In java three way to find String content.

String.contains("charSequence");
String.contentEquals("charSequence");
String.contentEquals("StringBuffer");

And you are be able to get max a String of length Integer.MAX_VALUE (always 2147483647 (2^31 - 1)) by the Java specification.

edited Sep 21 '12 at 18:37

answered Sep 21 '12 at 18:31

Subhrajyoti Majumder

40,646
13
77
103

Check if a string is available in a bigger string of length 100,000

4 Answers4