2

I have a service that processes a lot of messages (Those messages are at most 100 characters).

One of the tasks to do is detect if a message contains a string, that is always the same. Which one of the following things could be faster? Regex, Precompiled Regex, IndexOf, contains or any other?

colymore
  • 11,776
  • 13
  • 48
  • 90

3 Answers3

0

Just use contains: if you see the message for the first time, there is essentially nothing else you could do, because you would have to look at the content of the message at least once. But while you are looking, you can simultaneously check whether it contains the special keyword.

What you could (and probably should) do is to process multiple messages in parallel, since your task seems to be embarrassingly parallel in the number of messages.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
0

I have a different theory, since you are referring to a service which processes a lot of messages and the messages are long, I would suggest you use the Pattern, Matcher since it would be the correct way to find texts with regular expressions. In practice (using large texts) which in your case, it will be the most efficient way. This is because a constant pattern (like "ho") will not be processed by the regex engine (which is slow) but by an Boyer-Moore-Algorithm (which is fast). Also, if you are implementing a service that just does process messages based on some patterns, you must keep it flexible so that it allows expandable search patterns, rather then being fixed - i.e, the services get's the patterns from the config and applies it based on a criteria.

This guide shows how to implement the search patterns, apply the ones that fit your needs. It has very good examples on how to use Quantifiers, Boundary Matchers etc. Here is another helpful links that point to blogs that focus on regex and String.matches performance.

Vikram Palakurthi
  • 2,406
  • 1
  • 27
  • 30
  • But messages are not larges, size would be between 10 chars and 100. Does it still apply? – colymore May 21 '18 at 15:16
  • Let's say messages that are up to length 40 (just a number I am using, this is not a benchmark) can be done through indexOf but as it goes up it's better to do it through pattern compile and matchers. But if you are way sure that the messages won't go over 100 you could still be good with indexOf but since your search service could evolve and the services process a lot of them you are better of with the pattern and matchers. – Vikram Palakurthi May 21 '18 at 15:24
  • is the message length fixed 100? will it grow beyond that? what if it grow upto 2000? can you reuse this search services to process longer messages. You could also categorize if you have use indexOf or Patterns by checking length. If you are implementing this as a standard in your company being flexible is always good. – Vikram Palakurthi May 21 '18 at 15:27
0

If you search for exactly one string in multiple texts than prefer to use a string search algorithm. One is implicitly use if you search for a constant pattern with the jdk java.util.regex.Pattern. There are faster algorithms and their performance differs on:

  • the size of the alphabet
  • the size of the pattern

If you search for multiple strings there is no alternative in the jdk (do not use java.util.regex because it cannot search efficiently for multiple strings). Multi-String-Algorithms differ in performance on

  • the size of the alphabet
  • the size of the pattern
  • the number of searched patterns

You can find an overview of single string/multi string algorithms at StringSearchAlgorithms.

CoronA
  • 7,717
  • 2
  • 26
  • 53