2

I'm working on a project in which I must search for a little string (about 40 chars) in a very very large string (we're speaking of about a hundred million chars). I'm searching the fastest way. I've tried several methods: these are the results of benchmarks:

  • Contains returned True in 248 ms;
  • IndexOf returned True in 671 ms (I would have never said that!);
  • Contains using an array instead of a string returned True in 48 ms only;

Even though the Contains inside an array seemed to be the best method, I've had a look at some search algoritms too (Knuth–Morris–Pratt, Rabin-Karp and Boyer-Moore), but none of them seems to be suitable for my scenario

My question is: is there a faster way to search a little string in a very big string?

Thanks,

PWhite

PWhite
  • 141
  • 1
  • 12
  • 1
    This just begs the question, how are you getting the very large string in the first place?... – Idle_Mind Dec 09 '14 at 20:36
  • 2
    How does the "Contains using an array" case look? – Magnus Dec 09 '14 at 20:39
  • 1
    [Contains just call IndexOf](http://stackoverflow.com/a/498722/130611), your result seems stange. Are you running in multi thread? – the_lotus Dec 09 '14 at 20:39
  • 1
    @the_lotus `Contains` calls `IndexOf` with the `StringComparison.Ordinal` comparison type, that is probably why it is faster. – Magnus Dec 09 '14 at 20:42
  • 2
    Trying to optimise this suggests that you will be doing it more than just once. Will you be searching the same huge string for more than one word? Is there anything more that you can tell about what the huge string is, how it is created, or why you are searching it, that can help in finding a way to search other than just brute force? – Guffa Dec 09 '14 at 20:46
  • Hi all, thanks for your answers. @Guffa: I'm searching a little string in a huge textual database of clothes and, yes, I'm doing this more than once subsequently; @Magnus: I read the file with `File.ReadAllLines` instead of `File.ReadAllText`. My question is: is there a faster method that `Contains` (with `ReadAllLines`) or shall I use it? Thanks! _FWhite_ – PWhite Dec 10 '14 at 15:32

1 Answers1

0

If you have a fixed, gigantic string that you want to search repeatedly for very small substrings, you may want to search for a library that implements a suffix tree or suffix array. Once you've done the preprocessing work to build the suffix tree or array, the runtime of searching for a pattern of length P is O(P) for the suffix tree and O(P + log T) for the suffix tree, where T is the length of the long text string. That's likely to be significantly faster than what you're seeing now, though you'll need heavierweight libraries to do this.

On the other hand, if you have a fixed set of pattern strings and a rotating cast of large strings to search in, you may want to use the Aho-Corasick string matching algorithm, which can scan for all occurrences of a fixed set of patterns in a string of length T in time O(T + z), where z is the number of matches. This is typically very fast in practice.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065