7

I would like to ask if there is an algorithm for counting the number of discrete occurencies of a substring in a string in O(n) time.

Brian Webster
  • 30,033
  • 48
  • 152
  • 225
GioTse
  • 71
  • 1
  • 2
  • @alex: it's a question about the *existence* of an algorithm. OK, so nothing is truly Turing-complete due to practical limits, but I'm pretty sure that an algorithm exists in C if and only if an algorithm exists in Lisp, Javascript, or any other semi-decent programming language. – Steve Jessop Jun 19 '11 at 11:44
  • @Steve Sorry, I assumed the OP wanted example code and not just an algorithm. – alex Jun 19 '11 at 11:45
  • @alex: well, you might be right and they do want code. Maybe I'm being too literal in only considering what they actually asked for :-) – Steve Jessop Jun 19 '11 at 11:47
  • what is n in O(n)? Do you have one substring and one string or something else? – Luka Rahne Jun 19 '11 at 12:17
  • simple linear search for for substring in the string can also be good i guess....???correct me if i m wrong.... – Abhimanyu Srivastava Jun 19 '11 at 12:44

2 Answers2

10

[EDIT 17/11/2013: Count the leaf nodes. Thanks Vinicius Pinto!]

You can build a suffix tree on the text in linear time. Then, search for your substring in the suffix tree; when you find it, count the number of leaf nodes beneath the matching node. This is O(m + k) for a substring of length m that appears k times (add an n term for building the suffix tree). Or, you can precalculate the number of descendants for each node in the tree using a depth-first traversal -- this will give O(m) queries.

For large texts, suffix arrays are often faster than suffix trees in practice, despite searching for a single length-m string dropping from O(m) to O(m log m). In this case, all occurrences of a particular substring will appear as a contiguous block in the suffix array, so the width of this block is the number of occurrences.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
4

You can use the KMP Algorithm and modify it to increment a counter instead of returning.

Another possibility is the Rabin-Karp algorithm, however this relies on hashing, so you either have to accept the possibility of false positives while keeping the complexity linear, or accept the possibility of quadratic complexity while keeping the results 100% correct.

IVlad
  • 43,099
  • 13
  • 111
  • 179