4

If I need to check if for example a word A or word B exists in a text (String), is there a performance difference if I do:

if(text.contains(wordA) || text.contains(wordB))

to using some regular expression that searches the string?
Does it depend on the regular expression format?
Or is it just a matter of taste?

UPDATE:
If text.contains(wordA) is false then the text.contains(wordB) will be evaluated.
This means that contains will be called twice.

I was thinking if in performance terms a regex might be better than calling contains twice.

Cratylus
  • 52,998
  • 69
  • 209
  • 339
  • If your data comes from a disk, the I/O overhead is going to completely dominate your measurements, to the point where the cost of compiling a regex *and* scanning the buffer twice will be reduced into the measurement error margin. – tripleee Jan 20 '12 at 08:48

5 Answers5

4

The code you have expresses your intent clearly, is more readable than a regexp, and is also probably faster.

Anyway, there is a very low probability that this part of your code causes any significant performance problem. So I wouldn't worry about performance here, but about readability and maintainability.

JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255
  • I understand what you are saying.My concern is that the `contains(a)` or `contains(b)` will parse twice and could be that a regular expression could do it in a single parse for instance.Do you think that this could be an issue? – Cratylus Jan 20 '12 at 07:37
  • What do you mean by parse twice? if text.contains(wordA) returns true, the second expression will never be evaluated.. – quaylar Jan 20 '12 at 07:41
  • @quaylar:If the `text.contains(wordA)` is `false` the second will be evaluated.I thought that with a regex it would be faster since I assume than calling `contains` twice. – Cratylus Jan 21 '12 at 08:53
4

While the performance of regular expression is lower, it has more expressive power and often this is more important. For example.

 "performance".contains("form") // is true

this may not be wheat you intended by a "word" Instead you can have a pattern

 "\\bform\\b"

This will only match a complete word in a string which can be at the start or the end.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
3

Yes their is a difference. Contains does various array manipulation to find the words, a regex uses diffent logic so it will be different, performance will even change depending how you uses the regular expression matching.

Will it be significant ? thats hard to tell. But the best thing you should realise:

First write your code and dont bother with questioning performance until you run into problems, after profiling clearly indicates that this test is the issue.

I would just use the contains method. But thats an opinion without actually testing anything.

Peter
  • 5,728
  • 20
  • 23
2

With this trivial example you shouldn't see much of a performance difference, but purely from the algorithms involved the regular expression

wordA|wordB

would indeed be faster, as it just makes a single pass through the string and employs a finite automaton to match one of the two substrings. However, this is offset by building the finite automaton first, which should be pretty much linear in the length of the regex in this case. You can compile the regex first to have that cost only once as long as the compiled object lives.

So essentially cost comes down to:

  • linear search through the string twice (2 · string length)
  • or linear search through the string once and building the DFA (string length + regex length)

if your text is very large and the substrings very small, then this could be worthwhile.

Still, you're optimising the wrong place, most likely. Use a profiler to find the actual bottlenecks in your code and optimise those; don't ever worry about such trivial “optimisations” unless you can prove them to make an impact.

One final thing to consider, though: With a regex you could make sure you're actually matching words (or things that look like words) instead of word parts, which might be an actual reason to consider regex instead of contains.

Joey
  • 344,408
  • 85
  • 689
  • 683
2

In my opinion its a matter of taste. Avoid doing premature optimization, see Practical rules for premature optimization.

  1. As a general rule, if you are looking for words substrings and not patterns, then don't use regular expressions.

  2. There will be only a minor performance difference for such a simple regex against the text search, so if you do this search only once in a while its not a performance issue. If you do it for some thousand times or more, in a loop, then make a benchmark, if you have performance problems

Community
  • 1
  • 1
stema
  • 90,351
  • 20
  • 107
  • 135
  • stema:Concerning rule 1, is it a standard approach or your suggestion?I am somewhat interested in your rule 1 – Cratylus Jan 20 '12 at 07:43
  • I would say its a standard approach, regular expressions are a excellent, powerful tool for **pattern matching** and I wouldn't consider a fixed word a pattern. As soon as this word has variations or has to be in a certain place, it looks different. – stema Jan 20 '12 at 07:47
  • So you suggest this only if I do search for `wordA` as it is.If I would consider a match also `someword-wordA` your suggestion would still apply? – Cratylus Jan 20 '12 at 07:50
  • @user384706 I changed a word in my rule 1. Peter Lawrey's answer reminded me that a word is already a pattern, if this word is contained in another word, it should normally not be found (depend on your requirements) – stema Jan 20 '12 at 07:54