36

I have an old piece of code that performs find and replace of tokens within a string.

It receives a map of from and to pairs, iterates over them and for each of those pairs, iterates over the target string, looks for the from using indexOf(), and replaces it with the value of to. It does all the work on a StringBuffer and eventually returns a String.

I replaced that code with this line: replaceAll("[,. ]*", "");
And I ran some comparative performance tests.
When comparing for 1,000,000 iterations, I got this:

Old Code: 1287ms
New Code: 4605ms

3 times longer!

I then tried replacing it with 3 calls to replace:
replace(",", "");
replace(".", "");
replace(" ", "");

This resulted with the following results:

Old Code: 1295
New Code: 3524

2 times longer!

Any idea why replace and replaceAll are so inefficient? Can I do something to make it faster?


Edit: Thanks for all the answers - the main problem was indeed that [,. ]* did not do what I wanted it to do. Changing it to be [,. ]+ almost equaled the performance of the non-Regex based solution. Using a pre-compiled regex helped, but was marginal. (It is a solution very applicable for my problem.

Test code:
Replace string with Regex: [,. ]*
Replace string with Regex: [,. ]+
Replace string with Regex: [,. ]+ and Pre-Compiled Pattern

RonK
  • 9,472
  • 8
  • 51
  • 87

5 Answers5

69

While using regular expressions imparts some performance impact, it should not be as terrible.

Note that using String.replaceAll() will compile the regular expression each time you call it.

You can avoid that by explicitly using a Pattern object:

Pattern p = Pattern.compile("[,. ]+");

// repeat only the following part:
String output = p.matcher(input).replaceAll("");

Note also that using + instead of * avoids replacing empty strings and therefore might also speed up the process.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • @Lukas: it doesn't change *what* is done (in this case, where the replacement is the empty string). It *might* change the performance characteristics, but I don't know if leaving out `+` is faster or slower. – Joachim Sauer Jun 07 '11 at 08:27
  • 4
    @Joachim, interestingly, with a simple benchmark test on your solution, `[]+` seems the fastest (1x), `[]` in the middle (1.5x) , and `[]*` the slowest (2x). – Lukas Eder Jun 07 '11 at 08:33
  • This answer assumes that the map he receives is reused over and over again. – aioobe Jun 07 '11 at 08:34
  • @Joachim, []* matches the empty String "". So, for each character, a match is found of "" and replaced by "". This causes it to be slower. You can test it by doing ....replaceAll("$$$$") and watching the output. – SJuan76 Jun 07 '11 at 10:21
  • @SJuan76: I'm well aware of that (I already explained that in my answer). The only thing I didn't know was if `replaceAll("[,. ]+", "")` would be slower or faster than `replaceAll("[,. ]", "")`. – Joachim Sauer Jun 07 '11 at 10:41
  • @Joachim "some"? why are we now afraid to call out serious performance issues? Is it because the Knuth quoting brigade will pop-up with a useless essay on premature optimisation blah blah. The performance of replaceAll is terrible. If it is in the inner loop of some text processing code it can be *the* major overhead. – rghome Jul 24 '20 at 07:43
  • @rghome: I don't know why you think this is about fear. That sentence (written in 2011, by the way) is about the fact that you can certainly solve this with regexes with better performance than what OP was describing. And it turns out if you write a better regex, then performance gets much better indeed. – Joachim Sauer Jul 24 '20 at 08:59
  • @rghome: and as an off-topic comment: you complain about the "Knuth quoting brigade" and describe yourself as a "serial downvoter". This is not a war or a fight, we're all just trying to write better software. – Joachim Sauer Jul 24 '20 at 09:08
  • @Joachim Sauer Please don't take my comment too seriously. But Pattern.compile is really expensive if called repeatedly (I know, I am looking at a profiled system now), and I find there can be a tendency on SO to water-down performance issues. In the interests of better software, I don't think we need to continually qualify every performance answer. So I am just pushing back a bit against the climate. – rghome Jul 24 '20 at 10:32
10

replace and replaceAll uses regex internally which in most cases gives a serious performance impact compared to e.g., StringUtils.replace(..).

String.replaceAll():

public String replaceAll(String regex, String replacement) {
        return Pattern.compile(regex).matcher(this ).replaceAll(
             replacement);
}

String.replace() uses Pattern.compile underneath.

public String replace(CharSequence target, CharSequence replacement) {
  return Pattern.compile(target.toString(), Pattern.LITERAL)
         .matcher(this ).replaceAll(
           Matcher.quoteReplacement(replacement.toString()));
}

Also see Replace all occurrences of substring in a string - which is more efficient in Java?

Community
  • 1
  • 1
Johan Sjöberg
  • 47,929
  • 21
  • 130
  • 148
4

As I have put in a comment [,. ]* matches the empty String "". So, every "space" between characters matches the pattern. It is only noted in performance because you are replacing a lot of "" by "".

Try doing this:

Pattern p = Pattern.compile("[,. ]*");
System.out.println(p.matcher("Hello World").replaceAll("$$$");

It returns:

H$$$e$$$l$$$o$$$$$$W$$$o$$$r$$$l$$$d$$$!$$$

No wonder it is slower that doing it "by hand"! You should try with [,. ]+

SJuan76
  • 24,532
  • 6
  • 47
  • 87
1

When it comes to replaceAll("[,. ]*", "") it's not that big of a surprise since it relies on regular expressions. The regex engine creates an automaton which it runs over the input. Some overhead is expected.

The second approach (replace(",", "")...) also uses regular expressions internally. Here the given pattern is however compiled using Pattern.LITERAL so the regular expression overhead should be negligable.) In this case it is probably due to the fact that Strings are immutable (however small change you do, you will create a new string) and thus not as efficient as StringBuffers which manipulate the string in-place.

aioobe
  • 413,195
  • 112
  • 811
  • 826
0

This is a very old post, but for the record, since Java 9, the performance characteristics of String.replace(String,String) has changed, it now does not use Pattern under the hood

https://medium.com/javarevisited/micro-optimizations-in-java-string-replaceall-c6d0edf2ef6

chipukb
  • 213
  • 3
  • 7