29

I want to replace some strings in a String input :

string=string.replace("<h1>","<big><big><big><b>");
string=string.replace("</h1>","</b></big></big></big>");
string=string.replace("<h2>","<big><big>");
string=string.replace("</h2>","</big></big>");
string=string.replace("<h3>","<big>");
string=string.replace("</h3>","</big>");
string=string.replace("<h4>","<b>");
string=string.replace("</h4>","</b>");
string=string.replace("<h5>","<small><b>");
string=string.replace("</h5>","</b><small>");
string=string.replace("<h6>","<small>");
string=string.replace("</h6>","</small>");

As you can see this approach is not the best, because each time I have to search for the portion to replace etc, and Strings are immutable... Also the input is large, which means that some performance issues are to be considered.

Is there any better approach to reduce the complexity of this code ?

Raedwald
  • 46,613
  • 43
  • 151
  • 237
TheByeByeMan
  • 1,396
  • 2
  • 28
  • 50
  • 5
    `StringBuilder` would be the way to go. It doesn't create temporary Strings and has a [replace](http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html#replace(int,%20int,%20java.lang.String)) method – TheLostMind Nov 04 '14 at 12:35
  • 1
    Take a look at `StringBuilder`: http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html – jntme Nov 04 '14 at 12:36
  • 2
    `StringBuilder.replace(int,int,String)` can't be used directly in the same way as `String.replace(CharSequence,CharSequence)`. You have to use `StringBuilder.indexOf(String)` in a loop, which has a cost. – Deleplace Nov 04 '14 at 12:44
  • 1
    You should also look at using a regular expression matcher that finds all the tags in one pass. – Gene Nov 04 '14 at 12:50
  • 1
    @Gene - Wouldn't a regex with `replaceAll()` or `Pattern / Matcher` be less efficient? – TheLostMind Nov 04 '14 at 12:56
  • 1
    @TheLostMind It will be slightly slower than the `replaceAll`. (Make sure to compile the regex only once.) But it will save 10 passes over big data, which avoids cache misses. You'd have to test it, but I'm sure the regex will come out ahead, probably far ahead. See @Thomas's solution. It's exactly what I'm talking about. – Gene Nov 04 '14 at 13:01
  • 1
    I don't know the use-case of this code, but I would have some serious doubts that you wanna be doing this at all. The `` tags have a semantic meaning where `` etc. only have a visual meaning. And they have been around for ages so I see no compatibility issues like with ` – asontu Nov 04 '14 at 14:55
  • @funkwurm, Thanks for your remark, actually I want to convert some XHtml to Docx and the tags are causing me some troubles (Assuming that I cannot change the behavior of the converter, I have to adapt the Inputs to match my needs ) – TheByeByeMan Nov 04 '14 at 15:23
  • From what you are doing, it appears XSL may be better suited than plain replace. – njzk2 Nov 04 '14 at 16:36
  • possible duplicate of [Java Map replaceAll with multiple String matches](http://stackoverflow.com/questions/20341356/java-map-replaceall-with-multiple-string-matches) – MikeFHay Nov 04 '14 at 16:37
  • 1
    _Is there any better approach_? Yes, use an XML DOM manipulation library. – Salman A Nov 05 '14 at 06:41
  • @funkwurm, also the WYSIWYG editor used won't allow creating this kind of thing `

    – TheByeByeMan Nov 05 '14 at 12:59

7 Answers7

38

Although StringBuilder.replace() is a huge improvement compared to String.replace(), it is still very far from being optimal.

The problem with StringBuilder.replace() is that if the replacement has different length than the replaceable part (applies to our case), a bigger internal char array might have to be allocated, and the content has to be copied, and then the replace will occur (which also involves copying).

Imagine this: You have a text with 10.000 characters. If you want to replace the "XY" substring found at position 1 (2nd character) to "ABC", the implementation has to reallocate a char buffer which is at least larger by 1, has to copy the old content to the new array, and it has to copy 9.997 characters (starting at position 3) to the right by 1 to fit "ABC" into the place of "XY", and finally characters of "ABC" are copied to the starter position 1. This has to be done for every replace! This is slow.

Faster Solution: Building Output On-The-Fly

We can build the output on-the-fly: parts that don't contain replaceable texts can simply be appended to the output, and if we find a replaceable fragment, we append the replacement instead of it. Theoretically it's enough to loop over the input only once to generate the output. Sounds simple, and it's not that hard to implement it.

Implementation:

We will use a Map preloaded with mappings of the replaceable-replacement strings:

Map<String, String> map = new HashMap<>();
map.put("<h1>", "<big><big><big><b>");
map.put("</h1>", "</b></big></big></big>");
map.put("<h2>", "<big><big>");
map.put("</h2>", "</big></big>");
map.put("<h3>", "<big>");
map.put("</h3>", "</big>");
map.put("<h4>", "<b>");
map.put("</h4>", "</b>");
map.put("<h5>", "<small><b>");
map.put("</h5>", "</b></small>");
map.put("<h6>", "<small>");
map.put("</h6>", "</small>");

And using this, here is the replacer code: (more explanation after the code)

public static String replaceTags(String src, Map<String, String> map) {
    StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);

    for (int pos = 0;;) {
        int ltIdx = src.indexOf('<', pos);
        if (ltIdx < 0) {
            // No more '<', we're done:
            sb.append(src, pos, src.length());
            return sb.toString();
        }

        sb.append(src, pos, ltIdx); // Copy chars before '<'
        // Check if our hit is replaceable:
        boolean mismatch = true;
        for (Entry<String, String> e : map.entrySet()) {
            String key = e.getKey();
            if (src.regionMatches(ltIdx, key, 0, key.length())) {
                // Match, append the replacement:
                sb.append(e.getValue());
                pos = ltIdx + key.length();
                mismatch = false;
                break;
            }
        }
        if (mismatch) {
            sb.append('<');
            pos = ltIdx + 1;
        }
    }
}

Testing it:

String in = "Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End";
System.out.println(in);
System.out.println(replaceTags(in, map));

Output: (wrapped to avoid scroll bar)

Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End

Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End

This solution is faster than using regular expressions as that involves much overhead, like compiling a Pattern, creating a Matcher etc. and regexp is also much more general. It also creates many temporary objects under the hood which are thrown away after the replace. Here I only use a StringBuilder (plus char array under its hood) and the code iterates over the input String only once. Also this solution is much faster that using StringBuilder.replace() as detailed at the top of this answer.

Notes and Explanation

I initialized the StringBuilder in the replaceTags() method like this:

StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);

So basically I created it with an initial capacity of 150% of the length of the original String. This is because our replacements are longer than the replaceable texts, so if replacing occurs, the output will obviously be longer than the input. Giving a larger initial capacity to StringBuilder will result in no internal char[] reallocation at all (of course the required initial capacity depends on the replaceable-replacement pairs and their frequency/occurrence in the input, but this +50% is a good upper estimation).

I also utilized the fact that all replaceable strings start with a '<' character, so finding the next potential replaceable position becomes blazing-fast:

int ltIdx = src.indexOf('<', pos);

It's just a simple loop and char comparisons inside String, and since it always starts searching from pos (and not from the start of the input), overall the code iterates over the input String only once.

And finally to tell if a replaceable String does occur at the potential position, we use the String.regionMatches() method to check the replaceable stings which is also blazing-fast as all it does is just compares char values in a loop and returns at the very first mismatching character.

And a PLUS:

The question doesn't mention it, but our input is an HTML document. HTML tags are case-insensitive which means the input might contain <H1> instead of <h1>.
To this algorithm this is not a problem. The regionMatches() in the String class has an overload which supports case-insensitive comparison:

boolean regionMatches(boolean ignoreCase, int toffset, String other,
                          int ooffset, int len);

So if we want to modify our algorithm to also find and replace input tags which are the same but are written using different letter case, all we have to modify is this one line:

if (src.regionMatches(true, ltIdx, key, 0, key.length())) {

Using this modified code, replaceable tags become case-insensitive:

Yo<H1>TITLE</H1><h3>Hi!</h3>Nice day.<H6>Hi back!</H6>End
Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End
icza
  • 389,944
  • 63
  • 907
  • 827
  • +1, that is variation of what `appendReplacement`/`appendTail` from `Matcher` does. – Pshemo Nov 04 '14 at 14:08
  • 3
    "This one is much faster that using regular expressions" well, I wouldn't say *much faster*, it depends on how you use regex :) – Pshemo Nov 04 '14 at 14:09
  • @Pshemo Probably you're right, but using regexp usually invovles much overhead, like compiling a `Pattern`, creating a `Matcher` etc. It also creates many temporary objects udner the hood which are thrown away after the use. Here I only use a `StringBuilder` (plus `char` array under its hood). – icza Nov 04 '14 at 14:13
  • I agree, I wasn't saying that it will not be faster than regex, but *much* seemed to me to be exaggeration. – Pshemo Nov 04 '14 at 14:17
  • @Pshemo _Much_ still applies to the `StringBuilder.replace()` solution. I correct it. Thanks – icza Nov 04 '14 at 14:23
  • 2
    That is true, I only wanted to defend a little regex solutions, iterating with `replace(.., ..)` will definitely be much slower than your solution even just because each time it needs to start from beginning. – Pshemo Nov 04 '14 at 14:28
  • 2
    This is the approach I would use. Going through the entire string many times is inefficient. This approach also takes advantage of the fact that all the replacement strings start with '<' to simplify and speed-up the process. Given the nature of this example, we'd expect more than 90% of the text to be copied verbatim and this approach only does so once. – user3294068 Nov 04 '14 at 14:33
13

For performance - use StringBuilder. For convenience you can use Map to store values and replacements.

Map<String, String> map = new HashMap<>();
map.put("<h1>","<big><big><big><b>");
map.put("</h1>","</b></big></big></big>");
map.put("<h2>","<big><big>");
...
StringBuilder builder = new StringBuilder(yourString);
for (String key : map.keySet()) {
    replaceAll(builder, key, map.get(key));
}

... To replace all occurences in StringBuilder you can check here: Replace all occurrences of a String using StringBuilder?

public static void replaceAll(StringBuilder builder, String from, String to)
{
    int index = builder.indexOf(from);
    while (index != -1)
    {
        builder.replace(index, index + from.length(), to);
        index += to.length(); // Move to the end of the replacement
        index = builder.indexOf(from, index);
    }
}
Community
  • 1
  • 1
Heisenberg
  • 3,153
  • 3
  • 27
  • 55
8

Unfortunately StringBuilder doesn't provide a replace(string,string) method, so you might want to consider using Pattern and Matcher in conjunction with StringBuffer:

String input = ...;
StringBuffer sb = new StringBuffer();

Pattern p = Pattern.compile("</?(h1|h2|...)>");
Matcher m = p.matcher( input );
while( m.find() )
{
  String match = m.group();
  String replacement = ...; //get replacement for match, e.g. by lookup in a map

  m.appendReplacement( sb, replacement );
}
m.appendTail( sb );

You could do something similar with StringBuilder but in that case you'd have to implement appendReplacement etc. yourself.

As for the expression you could also just try and match any html tag (although that might cause problems since regex and arbitrary html don't fit very well) and when the lookup doesn't have any result you just replace the match with itself.

Thomas
  • 87,414
  • 12
  • 119
  • 157
5

The particular example you provide seems to be HTML or XHTML. Trying to edit HTML or XML using regular expressions is frought with problems. For the kind of editing you seem to be interested in doing you should look at using XSLT. Another possibility is to use SAX, the streaming XML parser, and have your back-end write the edited output on the fly. If the text is actually HTML, you might be better using a tolerant HTML parser, such as JSoup, to build a parsed representation of the document (like the DOM), and manipulate that before outputting it.

Raedwald
  • 46,613
  • 43
  • 151
  • 237
4

StringBuilder is backed by a char array. So, unlike String instances, it is mutable. Thus, you can call indexOf() and replace() on the StringBuilder.

EpicPandaForce
  • 79,669
  • 27
  • 256
  • 428
TheLostMind
  • 35,966
  • 12
  • 68
  • 104
3

I would do something like this

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        if (tagEquals(str, i, "h1")) {
            sb.append("<big><big><big><b>");
            i += 2;
        } else (tagEquals(s, i, "/h1")) { 
            ...
        } else {
            sb.append(str.charAt(i));
        }
    }

tagEquals is a func which checks a tag name

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
3

Use Apache Commons StringUtils.replaceEach.

String[] searches =     new String[]{"<h1>",                "</h1>",                  "<h2>", ...};
String[] replacements = new String[]("<big><big><big><b>",  "</b></big></big></big>", "<big><big>" ...};
string = StringUtils.replaceEach(string, searches, replacements);
MikeFHay
  • 8,562
  • 4
  • 31
  • 52