6

I like to replace a certain set of characters of a string with a corresponding replacement character in an efficent way.

For example:

String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";

String result = replaceChars("Gračišće", sourceCharacters , targetCharacters );

Assert.equals(result,"Gracisce") == true;

Is there are more efficient way than to use the replaceAll method of the String class?

My first idea was:

final String s = "Gračišće";
String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";

// preparation
final char[] sourceString = s.toCharArray();
final char result[] = new char[sourceString.length];
final char[] targetCharactersArray = targetCharacters.toCharArray();

// main work
for(int i=0,l=sourceString.length;i<l;++i)
{
  final int pos = sourceCharacters.indexOf(sourceString[i]);
  result[i] = pos!=-1 ? targetCharactersArray[pos] : sourceString[i];
}

// result
String resultString = new String(result);

Any ideas?

Btw, the UTF-8 characters are causing the trouble, with US_ASCII it works fine.

John Topley
  • 113,588
  • 46
  • 195
  • 237
ManBugra
  • 1,289
  • 2
  • 14
  • 20

2 Answers2

15

You can make use of java.text.Normalizer and a shot of regex to get rid of the diacritics of which there exist much more than you have collected as far.

Here's an SSCCE, copy'n'paste'n'run it on Java 6:

package com.stackoverflow.q2653739;

import java.text.Normalizer;
import java.text.Normalizer.Form;

public class Test {

    public static void main(String... args) {
        System.out.println(removeDiacriticalMarks("Gračišće"));
    }

    public static String removeDiacriticalMarks(String string) {
        return Normalizer.normalize(string, Form.NFD)
            .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    }
}

This should yield

Gracisce

At least, it does here at Eclipse with console character encoding set to UTF-8 (Window > Preferences > General > Workspace > Text File Encoding). Ensure that the same is set in your environment as well.

As an alternative, maintain a Map<Character, Character>:

Map<Character, Character> charReplacementMap = new HashMap<Character, Character>();
charReplacementMap.put('š', 's');
charReplacementMap.put('đ', 'd');
// Put more here.

String originalString = "Gračišće";
StringBuilder builder = new StringBuilder();

for (char currentChar : originalString.toCharArray()) {
    Character replacementChar = charReplacementMap.get(currentChar);
    builder.append(replacementChar != null ? replacementChar : currentChar);
}

String newString = builder.toString();
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • with this solution i get: GraA?iA¡Ae. and btw, i'd like to replace not only diacritic characters but some others of other languages too. so i really would like to know a solution that works for an arbitrary mapping. – ManBugra Apr 16 '10 at 14:44
  • 1
    Exactly. The problem is that the diacritics are sometimes combined, sometimes not, and string character-by-character replace gets confused because there are actually two characters, not one. – Mr. Shiny and New 安宇 Apr 16 '10 at 14:46
  • @Mr. Shiny and New: yes, System.out.println("š".toCharArray().length); outputs '2' – ManBugra Apr 16 '10 at 14:49
  • @Mr. Shiny and @ManBurga: The `.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");` should take care about removing the combining diacritical marks. Maybe you removed this line? Or you're running an ancient Java version? The above has worked fine for years here and it works for an arbitrary mapping expect of certain Polish characters such as a l with a hyphen through it, since it's not an diacritic. – BalusC Apr 16 '10 at 14:51
  • @BalusC: java1.6 on Vista using IntelliJ IDEA, and sorry, i just cant get it working. can you please edit your post and add the imports? – ManBugra Apr 16 '10 at 14:55
  • Done. It's by the way the IDE console which needs to be set to UTF-8. I tried to reproduce here with the console set to ISO-8859-1 and I got the same as you. – BalusC Apr 16 '10 at 15:01
  • @BalusC: yes, console settings was f*d up. it works now. but still, i need a function for an arbitrary character mapping. – ManBugra Apr 16 '10 at 15:08
0

I'd use the replace method in a simple loop.

String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";

String s = "Gračišće";
for (int i=0 ; i<sourceCharacters.length() ; i++)
    s = s.replace(sourceCharacters.charAt[i], targetCharacters.charAt[i]);

System.out.println(s);
Donal Fellows
  • 133,037
  • 18
  • 149
  • 215
  • each iteration would create a new string object. would be nice to do it 'in place' – ManBugra Apr 16 '10 at 14:52
  • Firstly, each iteration only makes a new object if a change is done; if the character being searched for isn't there, the original object is returned. Secondly, it's *far* more annoying to write this code using `StringBuilder` or `StringBuffer` as you have to do all the work yourself; since Java's memory management is tuned for rapid object turnover anyway, it's easier to do it the way I showed instead of trying to figure out how to be efficient. You can always optimize later if really necessary (i.e., if it is a real bottleneck). – Donal Fellows Apr 16 '10 at 15:29
  • yes your are right at your first point. but i dont agree with your second. you write efficient code once, even it's annoying, and than reuse it. anyway BalusC solved the riddle. – ManBugra Apr 16 '10 at 15:47