6

My pattern is OR-like : "word1|word2|word3" I have approximately 800 words.

Can it be a problem ?

Andreas Dolk
  • 113,398
  • 19
  • 180
  • 268
Johnny
  • 69
  • 1
  • 2

3 Answers3

6

You're only limited by memory and sanity. :)

vipw
  • 7,593
  • 4
  • 25
  • 48
4

You might consider using the Aho–Corasick string searching algorithm. It would be much more efficient than a regex, since it's linear and optimized for your problem. It's also a way to pay respect to our fellows from 1975 !

In particular, there is this Java implementation.

Remi Mélisson
  • 2,724
  • 2
  • 21
  • 13
3

Why should it be? No, probably not.

A regexp with 800 words indicates a design problem somewhere, I would say. Why and what for do you need 800 words?

Miki
  • 7,052
  • 2
  • 29
  • 39
  • +1 for design problem. I'd suggest to the op to edit this question providing more detail. – Liviu T. Jun 15 '11 at 14:39
  • I am in fact counting the occurrences of a list of 800 words in a corpus. Is there any better way to do it rather than with a regex ? – Johnny Jun 19 '11 at 12:02
  • Yes, using `HashMap`. Split the text on non-words (regex `\W` if I am not mistaken), which will give you an array of `String`s. Go through each element of that array. If the hash map contains the word, increase the value by 1; if not - insert the word as a key and put 1 as value. This should be more efficient, but that is only my opinion. – Miki Jun 19 '11 at 13:35