117

I need to replace many different sub-string in a string in the most efficient way. is there another way other then the brute force way of replacing each field using string.replace ?

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
Yossale
  • 14,165
  • 22
  • 82
  • 109

11 Answers11

114

If the string you are operating on is very long, or you are operating on many strings, then it could be worthwhile using a java.util.regex.Matcher (this requires time up-front to compile, so it won't be efficient if your input is very small or your search pattern changes frequently).

Below is a full example, based on a list of tokens taken from a map. (Uses StringUtils from Apache Commons Lang).

Map<String,String> tokens = new HashMap<String,String>();
tokens.put("cat", "Garfield");
tokens.put("beverage", "coffee");

String template = "%cat% really needs some %beverage%.";

// Create pattern of the format "%(cat|beverage)%"
String patternString = "%(" + StringUtils.join(tokens.keySet(), "|") + ")%";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(template);

StringBuffer sb = new StringBuffer();
while(matcher.find()) {
    matcher.appendReplacement(sb, tokens.get(matcher.group(1)));
}
matcher.appendTail(sb);

System.out.println(sb.toString());

Once the regular expression is compiled, scanning the input string is generally very quick (although if your regular expression is complex or involves backtracking then you would still need to benchmark in order to confirm this!)

Todd Owen
  • 15,650
  • 7
  • 54
  • 52
  • 1
    Yes, needs to be benchmarked for the number of iterations though. – techzen Aug 25 '09 at 09:00
  • 6
    I think you should escape special characters in each token before doing `"%(" + StringUtils.join(tokens.keySet(), "|") + ")%";` – Developer Marius Žilėnas Mar 19 '15 at 08:59
  • Note that one can use StringBuilder for a bit more speed. StringBuilder isn't synchronized. [edit] whoops only works with java 9 though – Tinus Tate Apr 26 '18 at 18:34
  • 3
    Future reader : for regex, "(" and ")" will enclose the group to search. The "%" counts as a literal in the text. If your terms don't start AND end with the "%" they will not be found. So adjust prefixes and suffixes on both parts (text + code). – linuxunil May 24 '19 at 17:53
96

Algorithm

One of the most efficient ways to replace matching strings (without regular expressions) is to use the Aho-Corasick algorithm with a performant Trie (pronounced "try"), fast hashing algorithm, and efficient collections implementation.

Simple Code

A simple solution leverages Apache's StringUtils.replaceEach as follows:

  private String testStringUtils(
    final String text, final Map<String, String> definitions ) {
    final String[] keys = keys( definitions );
    final String[] values = values( definitions );

    return StringUtils.replaceEach( text, keys, values );
  }

This slows down on large texts.

Fast Code

Bor's implementation of the Aho-Corasick algorithm introduces a bit more complexity that becomes an implementation detail by using a façade with the same method signature:

  private String testBorAhoCorasick(
    final String text, final Map<String, String> definitions ) {
    // Create a buffer sufficiently large that re-allocations are minimized.
    final StringBuilder sb = new StringBuilder( text.length() << 1 );

    final TrieBuilder builder = Trie.builder();
    builder.onlyWholeWords();
    builder.removeOverlaps();

    final String[] keys = keys( definitions );

    for( final String key : keys ) {
      builder.addKeyword( key );
    }

    final Trie trie = builder.build();
    final Collection<Emit> emits = trie.parseText( text );

    int prevIndex = 0;

    for( final Emit emit : emits ) {
      final int matchIndex = emit.getStart();

      sb.append( text.substring( prevIndex, matchIndex ) );
      sb.append( definitions.get( emit.getKeyword() ) );
      prevIndex = emit.getEnd() + 1;
    }

    // Add the remainder of the string (contains no more matches).
    sb.append( text.substring( prevIndex ) );

    return sb.toString();
  }

Benchmarks

For the benchmarks, the buffer was created using randomNumeric as follows:

  private final static int TEXT_SIZE = 1000;
  private final static int MATCHES_DIVISOR = 10;

  private final static StringBuilder SOURCE
    = new StringBuilder( randomNumeric( TEXT_SIZE ) );

Where MATCHES_DIVISOR dictates the number of variables to inject:

  private void injectVariables( final Map<String, String> definitions ) {
    for( int i = (SOURCE.length() / MATCHES_DIVISOR) + 1; i > 0; i-- ) {
      final int r = current().nextInt( 1, SOURCE.length() );
      SOURCE.insert( r, randomKey( definitions ) );
    }
  }

The benchmark code itself (JMH seemed overkill):

long duration = System.nanoTime();
final String result = testBorAhoCorasick( text, definitions );
duration = System.nanoTime() - duration;
System.out.println( elapsed( duration ) );

1,000,000 : 1,000

A simple micro-benchmark with 1,000,000 characters and 1,000 randomly-placed strings to replace.

  • testStringUtils: 25 seconds, 25533 millis
  • testBorAhoCorasick: 0 seconds, 68 millis

No contest.

10,000 : 1,000

Using 10,000 characters and 1,000 matching strings to replace:

  • testStringUtils: 1 seconds, 1402 millis
  • testBorAhoCorasick: 0 seconds, 37 millis

The divide closes.

1,000 : 10

Using 1,000 characters and 10 matching strings to replace:

  • testStringUtils: 0 seconds, 7 millis
  • testBorAhoCorasick: 0 seconds, 19 millis

For short strings, the overhead of setting up Aho-Corasick eclipses the brute-force approach by StringUtils.replaceEach.

A hybrid approach based on text length is possible, to get the best of both implementations.

Implementations

Consider comparing other implementations for text longer than 1 MB, including:

Papers

Papers and information relating to the algorithm:

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
  • 5
    Kudos for updating this question with new valuable information, that's very nice. I think a JMH benchmark is still appropriate, at least for reasonable values like 10,000 : 1,000 and 1,000 : 10 (the JIT can do magic optimizations sometimes). – Tunaki Nov 28 '16 at 18:22
  • remove the builder.onlyWholeWords() and it will work similarly to string replace. – Ondrej Sotolar Sep 16 '19 at 13:35
  • Thank you very much for this excellent answer. This is definitely very helpful! I just wanted to comment that in order to compare the two approaches, and also to give a more meaningful example, one should build the Trie only once in the second approach, and apply it to many different input strings. I think this is the main advantage to having access to the Trie versus StringUtils: you only build it once. Still, thank you very much for this answer. It shares very well the methodology to implement the second approach – Vic Seedoubleyew Mar 05 '20 at 15:42
  • Your benchmarks are biaised, but biaised against the algorithm that you want to showcase! To perform a correct test, you should definitely cache the input parameter. Your Trie should be stored as class member or something like that, and then applied to each string you want to replace it in. Most of the time, people want to apply the same rules to a string, not re-create all the strings all the time. – Olivier Grégoire Apr 28 '22 at 11:32
  • This is a wiki, Olivier, feel free to revise the answer with updated benchmarks. – Dave Jarvis Apr 28 '22 at 15:51
20

This worked for me:

String result = input.replaceAll("string1|string2|string3","replacementString");

Example:

String input = "applemangobananaarefruits";
String result = input.replaceAll("mango|are|ts","-");
System.out.println(result);

Output: apple-banana-frui-

bikram
  • 7,127
  • 2
  • 51
  • 63
7

If you are going to be changing a String many times, then it is usually more efficient to use a StringBuilder (but measure your performance to find out):

String str = "The rain in Spain falls mainly on the plain";
StringBuilder sb = new StringBuilder(str);
// do your replacing in sb - although you'll find this trickier than simply using String
String newStr = sb.toString();

Every time you do a replace on a String, a new String object is created, because Strings are immutable. StringBuilder is mutable, that is, it can be changed as much as you want.

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
  • 1
    I'm afraid, it doesn't help. Whenever the replacement differs from the original in length, you'd need some shifting, which can be more costly than building the string anew. Or am I missing something? – maaartinus Nov 15 '19 at 21:57
4

StringBuilder will perform replace more efficiently, since its character array buffer can be specified to a required length.StringBuilder is designed for more than appending!

Of course the real question is whether this is an optimisation too far ? The JVM is very good at handling creation of multiple objects and the subsequent garbage collection, and like all optimisation questions, my first question is whether you've measured this and determined that it's a problem.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
3

Check this:

String.format(str,STR[])

For instance:

String.format( "Put your %s where your %s is", "money", "mouth" );
Max
  • 183
  • 1
  • 5
  • 18
Ali
  • 269
  • 1
  • 3
  • 11
2

Rythm a java template engine now released with an new feature called String interpolation mode which allows you do something like:

String result = Rythm.render("@name is inviting you", "Diana");

The above case shows you can pass argument to template by position. Rythm also allows you to pass arguments by name:

Map<String, Object> args = new HashMap<String, Object>();
args.put("title", "Mr.");
args.put("name", "John");
String result = Rythm.render("Hello @title @name", args);

Note Rythm is VERY FAST, about 2 to 3 times faster than String.format and velocity, because it compiles the template into java byte code, the runtime performance is very close to concatentation with StringBuilder.

Links:

Gelin Luo
  • 14,035
  • 27
  • 86
  • 139
  • This is very very old capability available with numerous templating languages like velocity, JSP even. Also it doesn't answer the question which doesn't require the search strings to be in any pre-defined format. – Angsuman Chakraborty Aug 10 '16 at 10:31
  • Interesting, the accepted answer provides an example: `"%cat% really needs some %beverage%."; `, isn't that `%` separated token a pre-defined format? Your first point is even more funny, JDK provides a lots of "old capabilities", some of them starts from 90's, why people bother using them? Your comments and downvoting does not make any real sense – Gelin Luo Aug 11 '16 at 08:06
  • What is the point of introducing Rythm template engine when there are already many pre-existing template engines, and widely used like Velocity or Freemarker to boot? Also why introduce another product when core Java functionalities more than suffice. I really doubt your statement on performance because Pattern can also be compiled. Would love to see some real numbers. – Angsuman Chakraborty Aug 12 '16 at 05:32
  • Green, You are missing the point. The questioner wants to replace arbitrary strings whereas your solution will replace only strings in predefined format like @ preceded. Yes, the example uses % but only as an example, not as a limiting factor. So you answer does not answer the question and hence the negative point. – Angsuman Chakraborty Aug 12 '16 at 05:34
2

The below is based on Todd Owen's answer. That solution has the problem that if the replacements contain characters that have special meaning in regular expressions, you can get unexpected results. I also wanted to be able to optionally do a case-insensitive search. Here is what I came up with:

/**
 * Performs simultaneous search/replace of multiple strings. Case Sensitive!
 */
public String replaceMultiple(String target, Map<String, String> replacements) {
  return replaceMultiple(target, replacements, true);
}

/**
 * Performs simultaneous search/replace of multiple strings.
 * 
 * @param target        string to perform replacements on.
 * @param replacements  map where key represents value to search for, and value represents replacem
 * @param caseSensitive whether or not the search is case-sensitive.
 * @return replaced string
 */
public String replaceMultiple(String target, Map<String, String> replacements, boolean caseSensitive) {
  if(target == null || "".equals(target) || replacements == null || replacements.size() == 0)
    return target;

  //if we are doing case-insensitive replacements, we need to make the map case-insensitive--make a new map with all-lower-case keys
  if(!caseSensitive) {
    Map<String, String> altReplacements = new HashMap<String, String>(replacements.size());
    for(String key : replacements.keySet())
      altReplacements.put(key.toLowerCase(), replacements.get(key));

    replacements = altReplacements;
  }

  StringBuilder patternString = new StringBuilder();
  if(!caseSensitive)
    patternString.append("(?i)");

  patternString.append('(');
  boolean first = true;
  for(String key : replacements.keySet()) {
    if(first)
      first = false;
    else
      patternString.append('|');

    patternString.append(Pattern.quote(key));
  }
  patternString.append(')');

  Pattern pattern = Pattern.compile(patternString.toString());
  Matcher matcher = pattern.matcher(target);

  StringBuffer res = new StringBuffer();
  while(matcher.find()) {
    String match = matcher.group(1);
    if(!caseSensitive)
      match = match.toLowerCase();
    matcher.appendReplacement(res, replacements.get(match));
  }
  matcher.appendTail(res);

  return res.toString();
}

Here are my unit test cases:

@Test
public void replaceMultipleTest() {
  assertNull(ExtStringUtils.replaceMultiple(null, null));
  assertNull(ExtStringUtils.replaceMultiple(null, Collections.<String, String>emptyMap()));
  assertEquals("", ExtStringUtils.replaceMultiple("", null));
  assertEquals("", ExtStringUtils.replaceMultiple("", Collections.<String, String>emptyMap()));

  assertEquals("folks, we are not sane anymore. with me, i promise you, we will burn in flames", ExtStringUtils.replaceMultiple("folks, we are not winning anymore. with me, i promise you, we will win big league", makeMap("win big league", "burn in flames", "winning", "sane")));

  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abccbaabccba", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaCBAbcCCBb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a"), false));

  assertEquals("c colon  backslash temp backslash  star  dot  star ", ExtStringUtils.replaceMultiple("c:\\temp\\*.*", makeMap(".", " dot ", ":", " colon ", "\\", " backslash ", "*", " star "), false));
}

private Map<String, String> makeMap(String ... vals) {
  Map<String, String> map = new HashMap<String, String>(vals.length / 2);
  for(int i = 1; i < vals.length; i+= 2)
    map.put(vals[i-1], vals[i]);
  return map;
}
Community
  • 1
  • 1
Kip
  • 107,154
  • 87
  • 232
  • 265
1

How about using the replaceAll() method?

Avi
  • 19,934
  • 4
  • 57
  • 70
  • 4
    Many different substrings can be handled in a regex (/substring1|substring2|.../). It all depends on what kind of replacement the OP is trying to do. – Avi Aug 25 '09 at 09:50
  • 5
    The OP is looking for something more efficient than `str.replaceAll(search1, replace1).replaceAll(search2, replace2).replaceAll(search3, replace3).replaceAll(search4, replace4)` – Kip Oct 05 '16 at 14:08
1
public String replace(String input, Map<String, String> pairs) {
  // Reverse lexic-order of keys is good enough for most cases,
  // as it puts longer words before their prefixes ("tool" before "too").
  // However, there are corner cases, which this algorithm doesn't handle
  // no matter what order of keys you choose, eg. it fails to match "edit"
  // before "bed" in "..bedit.." because "bed" appears first in the input,
  // but "edit" may be the desired longer match. Depends which you prefer.
  final Map<String, String> sorted = 
      new TreeMap<String, String>(Collections.reverseOrder());
  sorted.putAll(pairs);
  final String[] keys = sorted.keySet().toArray(new String[sorted.size()]);
  final String[] vals = sorted.values().toArray(new String[sorted.size()]);
  final int lo = 0, hi = input.length();
  final StringBuilder result = new StringBuilder();
  int s = lo;
  for (int i = s; i < hi; i++) {
    for (int p = 0; p < keys.length; p++) {
      if (input.regionMatches(i, keys[p], 0, keys[p].length())) {
        /* TODO: check for "edit", if this is "bed" in "..bedit.." case,
         * i.e. look ahead for all prioritized/longer keys starting within
         * the current match region; iff found, then ignore match ("bed")
         * and continue search (find "edit" later), else handle match. */
        // if (better-match-overlaps-right-ahead)
        //   continue;
        result.append(input, s, i).append(vals[p]);
        i += keys[p].length();
        s = i--;
      }
    }
  }
  if (s == lo) // no matches? no changes!
    return input;
  return result.append(input, s, hi).toString();
}
Robin479
  • 1,606
  • 15
  • 16
0

Summary: Single class implementation of Dave's answer, to automatically choose the most efficient of the two algorithms.

This is a full, single class implementation based on the above excellent answer from Dave Jarvis. The class automatically chooses between the two different supplied algorithms, for maximum efficiency. (This answer is for people who would just like to quickly copy and paste.)

ReplaceStrings class:

package somepackage

import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.ahocorasick.trie.Emit;
import org.ahocorasick.trie.Trie;
import org.ahocorasick.trie.Trie.TrieBuilder;
import org.apache.commons.lang3.StringUtils;

/**
 * ReplaceStrings, This class is used to replace multiple strings in a section of text, with high
 * time efficiency. The chosen algorithms were adapted from: https://stackoverflow.com/a/40836618
 */
public final class ReplaceStrings {

    /**
     * replace, This replaces multiple strings in a section of text, according to the supplied
     * search and replace definitions. For maximum efficiency, this will automatically choose
     * between two possible replacement algorithms.
     *
     * Performance note: If it is known in advance that the source text is long, then this method
     * signature has a very small additional performance advantage over the other method signature.
     * (Although either method signature will still choose the best algorithm.)
     */
    public static String replace(
        final String sourceText, final Map<String, String> searchReplaceDefinitions) {
        final boolean useLongAlgorithm
            = (sourceText.length() > 1000 || searchReplaceDefinitions.size() > 25);
        if (useLongAlgorithm) {
            // No parameter adaptations are needed for the long algorithm.
            return replaceUsing_AhoCorasickAlgorithm(sourceText, searchReplaceDefinitions);
        } else {
            // Create search and replace arrays, which are needed by the short algorithm.
            final ArrayList<String> searchList = new ArrayList<>();
            final ArrayList<String> replaceList = new ArrayList<>();
            final Set<Map.Entry<String, String>> allEntries = searchReplaceDefinitions.entrySet();
            for (Map.Entry<String, String> entry : allEntries) {
                searchList.add(entry.getKey());
                replaceList.add(entry.getValue());
            }
            return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replaceList);
        }
    }

    /**
     * replace, This replaces multiple strings in a section of text, according to the supplied
     * search strings and replacement strings. For maximum efficiency, this will automatically
     * choose between two possible replacement algorithms.
     *
     * Performance note: If it is known in advance that the source text is short, then this method
     * signature has a very small additional performance advantage over the other method signature.
     * (Although either method signature will still choose the best algorithm.)
     */
    public static String replace(final String sourceText,
        final ArrayList<String> searchList, final ArrayList<String> replacementList) {
        if (searchList.size() != replacementList.size()) {
            throw new RuntimeException("ReplaceStrings.replace(), "
                + "The search list and the replacement list must be the same size.");
        }
        final boolean useLongAlgorithm = (sourceText.length() > 1000 || searchList.size() > 25);
        if (useLongAlgorithm) {
            // Create a definitions map, which is needed by the long algorithm.
            HashMap<String, String> definitions = new HashMap<>();
            final int searchListLength = searchList.size();
            for (int index = 0; index < searchListLength; ++index) {
                definitions.put(searchList.get(index), replacementList.get(index));
            }
            return replaceUsing_AhoCorasickAlgorithm(sourceText, definitions);
        } else {
            // No parameter adaptations are needed for the short algorithm.
            return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replacementList);
        }
    }

    /**
     * replaceUsing_StringUtilsAlgorithm, This is a string replacement algorithm that is most
     * efficient for sourceText under 1000 characters, and less than 25 search strings.
     */
    private static String replaceUsing_StringUtilsAlgorithm(final String sourceText,
        final ArrayList<String> searchList, final ArrayList<String> replacementList) {
        final String[] searchArray = searchList.toArray(new String[]{});
        final String[] replacementArray = replacementList.toArray(new String[]{});
        return StringUtils.replaceEach(sourceText, searchArray, replacementArray);
    }

    /**
     * replaceUsing_AhoCorasickAlgorithm, This is a string replacement algorithm that is most
     * efficient for sourceText over 1000 characters, or large lists of search strings.
     */
    private static String replaceUsing_AhoCorasickAlgorithm(final String sourceText,
        final Map<String, String> searchReplaceDefinitions) {
        // Create a buffer sufficiently large that re-allocations are minimized.
        final StringBuilder sb = new StringBuilder(sourceText.length() << 1);
        final TrieBuilder builder = Trie.builder();
        builder.onlyWholeWords();
        builder.ignoreOverlaps();
        for (final String key : searchReplaceDefinitions.keySet()) {
            builder.addKeyword(key);
        }
        final Trie trie = builder.build();
        final Collection<Emit> emits = trie.parseText(sourceText);
        int prevIndex = 0;
        for (final Emit emit : emits) {
            final int matchIndex = emit.getStart();

            sb.append(sourceText.substring(prevIndex, matchIndex));
            sb.append(searchReplaceDefinitions.get(emit.getKeyword()));
            prevIndex = emit.getEnd() + 1;
        }
        // Add the remainder of the string (contains no more matches).
        sb.append(sourceText.substring(prevIndex));
        return sb.toString();
    }

    /**
     * main, This contains some test and example code.
     */
    public static void main(String[] args) {
        String shortSource = "The quick brown fox jumped over something. ";
        StringBuilder longSourceBuilder = new StringBuilder();
        for (int i = 0; i < 50; ++i) {
            longSourceBuilder.append(shortSource);
        }
        String longSource = longSourceBuilder.toString();
        HashMap<String, String> searchReplaceMap = new HashMap<>();
        ArrayList<String> searchList = new ArrayList<>();
        ArrayList<String> replaceList = new ArrayList<>();
        searchReplaceMap.put("fox", "grasshopper");
        searchReplaceMap.put("something", "the mountain");
        searchList.add("fox");
        replaceList.add("grasshopper");
        searchList.add("something");
        replaceList.add("the mountain");
        String shortResultUsingArrays = replace(shortSource, searchList, replaceList);
        String shortResultUsingMap = replace(shortSource, searchReplaceMap);
        String longResultUsingArrays = replace(longSource, searchList, replaceList);
        String longResultUsingMap = replace(longSource, searchReplaceMap);
        System.out.println(shortResultUsingArrays);
        System.out.println("----------------------------------------------");
        System.out.println(shortResultUsingMap);
        System.out.println("----------------------------------------------");
        System.out.println(longResultUsingArrays);
        System.out.println("----------------------------------------------");
        System.out.println(longResultUsingMap);
        System.out.println("----------------------------------------------");
    }
}

Needed Maven dependencies:

(Add these to your pom file if needed.)

    <!-- Apache Commons utilities. Super commonly used utilities.
    https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.10</version>
    </dependency>

    <!-- ahocorasick, An algorithm used for efficient searching and 
    replacing of multiple strings.
    https://mvnrepository.com/artifact/org.ahocorasick/ahocorasick -->
    <dependency>
        <groupId>org.ahocorasick</groupId>
        <artifactId>ahocorasick</artifactId>
        <version>0.4.0</version>
    </dependency>
BlakeTNC
  • 941
  • 11
  • 14