-1

Possible Duplicate:
What is the fastest substring search algorithm?

How do I check if a string is present in a bigger string of length of 100,000 characters in C++ or Java?

I know a method str.find("sub_string"); but it can't handle such a big string. The max execution time is 1 sec.

Also the sub strings I need to look for can be 50,000!

Community
  • 1
  • 1
Rajat
  • 313
  • 1
  • 8
  • 19
  • 3
    Which do you want? C++ or Java or both? The answers will be different for each. – Code-Apprentice Sep 21 '12 at 18:30
  • 1
    There is no way you can expect any algorithm to perform in under one second to do this. Execution time must be somewhere in O(n) to O(n^2), or worse. – Drise Sep 21 '12 at 18:30
  • Wondering where do you find a string of 100,000 length and of what use? – A Developer Sep 21 '12 at 18:31
  • 4
    Why is this an SO question? Just write code that does what you need using whatever [string search algorithm](http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) you like. As asked, this is way too vague to answer because it doesn't give us nearly enough information to choose the appropriate algorithm. (Does the substring change from search to search? Does the string? Is the string evenly distributed characters? Does the substring contain rare inner substrings? And so on.) – David Schwartz Sep 21 '12 at 18:31
  • 2
    c++ or java...any of them will work... – Rajat Sep 21 '12 at 18:31
  • 3
    @Drise It's definitely `O(n)`. But 100,000 should be doable in under 1 sec regardless. – Mysticial Sep 21 '12 at 18:34
  • Dear god, char*. Are you kidding? – Drise Sep 21 '12 at 18:35
  • 1
    I did some timings, see my answer, 4ms... – Adam Sep 21 '12 at 18:37
  • @Mysticial: Actually (using one of the Boyer-Moore variants, for example) this can be sub-linear (e.g., in best case, the number of comparisons is N/M, where N is length of haystack and M is length of needle). – Jerry Coffin Sep 21 '12 at 18:37
  • also my test cases can be 50,000!! what should i do in such case? – Rajat Sep 21 '12 at 19:01

4 Answers4

5

In C or C++, you can just use malloc to get a chunk of 100,000 bytes. Fill it with your data. To find a needle in a haystack, you can use the following code:

void *mem_mem(void *haystack, int haystack_len, void *needle, int needle_len)
{
  const char *begin;
  const char *const last_possible
    = (const char *) haystack + haystack_len - needle_len;

  if (needle_len == 0)
    return (void *) &((const char *) haystack)[needle_len - 1];

  for (begin = (const char *) haystack; begin <= last_possible; ++begin)
    if (begin[0] == ((const char *) needle)[0] &&
    !memcmp ((const void *) &begin[1],
         (const void *) ((const char *) needle + 1),
         needle_len - 1))
      return (void *) begin;

  return NULL;
}

On any reasonably modern platform, this will find any substring in 100,000 bytes in a tiny fraction of a second. You can modify it to use char * types trivially. If you do multiple searches in the same haystack, try to only compute the haystack length once. Don't call strlen when you don't need to.

This will be horribly sub-optimal if your haystack contains many repetitions of the first character or characters of your needle. For example, searching for "ab" in "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaqaaaa.." (or worse, "abc" in "abababababababab...abc...") will be slow. But you didn't give nearly enough details for us to determine the optimum implementation.

It's entirely possible that the point of the question is to write the algorithm with the best possible worst case performance. If so, this is probably not the "right" answer. One can imagine a haystack of all a's followed by a single b at the end and a needle consisting of all a's followed by a single b at the end. In that case, this algorithm might need a very long time.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • 1
    For one terrible question, I think this is the most reasonable answer, especially the repeated characters bit. – Drise Sep 21 '12 at 18:39
  • 2
    I suspect an online judge, so the test cases would probably include something like `needle = a^9999b, haystack=a^100000` which would not be doable within the time limit with the naive algorithm. – Daniel Fischer Sep 21 '12 at 18:40
  • 1
    @DanielFischer Good point. It's entirely possible that the point of the question is to write the algorithm with the best possible worst case performance. If so, this is probably not the "right" answer. – David Schwartz Sep 21 '12 at 18:41
  • You should incorporate [your comment](http://stackoverflow.com/questions/12536157/check-if-a-string-is-available-in-a-bigger-string-of-length-100-000#comment16879474_12536157) into the answer to cover that case too :D – Daniel Fischer Sep 21 '12 at 18:45
  • @DanielFischer: You're right. Done. – David Schwartz Sep 21 '12 at 18:47
  • also my test cases can be 50,000!! what should i do in such case? – Rajat Sep 21 '12 at 19:03
  • @r20rock: Can be 50,000 what? – David Schwartz Sep 21 '12 at 19:31
4

This completes almost instantly (4ms) on my modest 1st gen Intel iMac. I put the search string between two blocks of 100,000 characters in case java searches backwards.

StringBuilder builder = new StringBuilder();
for (int i = 0; i < 100000; i++) {
    builder.append((char) i);
}
builder.append("sub_string");
for (int i = 0; i < 100000; i++) {
    builder.append((char) i);
}
String maxString = builder.toString();
long t1 = System.currentTimeMillis();
System.out.println(maxString.contains("sub_string"));
long t2 = System.currentTimeMillis();
System.out.println(t2 - t1);

Output

true
4
Adam
  • 35,919
  • 9
  • 100
  • 137
1

Assuming java:

String.contains("stringtofind");

Is one way to find if string exists with in another string, javadoc.

kosa
  • 65,990
  • 13
  • 130
  • 167
1

In java three way to find String content.

String.contains("charSequence");
String.contentEquals("charSequence");
String.contentEquals("StringBuffer"); 

And you are be able to get max a String of length Integer.MAX_VALUE (always 2147483647 (2^31 - 1)) by the Java specification.

Subhrajyoti Majumder
  • 40,646
  • 13
  • 77
  • 103