5

I have searched SO (and Google) but not found any fully matching answer to my question:

I want to replace all swedish characters and whitespace in a String with another character. I would like it to work as follows:

  • "å" and "ä" should be replaced with "a"
  • "ö" should be replaced with "o"
  • "Å" and "Ä" should be replace with "A"
  • "Ö" should be replaced with "O"
  • " " should be replaced with "-"

Can this be achieved with regex (or any other way), and if so, how?

Of course, the below method does the job (and can be improved, I know, by replacing for example "å" and "ä" on the same line):

private String changeSwedishCharactersAndWhitespace(String string) {
    String newString = string.replaceAll("å", "a");
    newString = string.replaceAll("ä", "a");
    newString = string.replaceAll("ö", "o");
    newString = string.replaceAll("Å", "A");
    newString = string.replaceAll("Ä", "A");
    newString = string.replaceAll("Ö", "O");
    newString = string.replaceAll(" ", "-");
    return newString;
}

I know how to use regex to replace, for example, all "å", "ä", or "ö" with "". The question is how do I replace a character using regex with another depending on which character it is? There must surely be a better way using regex than the above aproach?

Magnilex
  • 11,584
  • 9
  • 62
  • 84
  • 3
    Would removing ALL diacritics work for you?... http://stackoverflow.com/questions/1453171/n-n-n-or-remove-diacritical-marks-from-unicode-cha – Zutty Nov 15 '12 at 11:35
  • Perhaps a regex with callback, but not your ordinary search-and-replace. Since Java doesn't have first-class functions, this will get unwieldy. Stick to what you have. – John Dvorak Nov 15 '12 at 11:36
  • @Zutty Thanks, but my real problem is that I don't want them removed, but replaced depending on the character. Otherwise I would have done something similar to your proposal. – Magnilex Nov 15 '12 at 11:43
  • @Zutty Changed my mind, that linked question/answer actually contained the answer to my question. Since english isn't my native language, I didn't really know the word "diacritic", and thought the whole character would be removed. – Magnilex Nov 15 '12 at 12:30

4 Answers4

6

For latin characters with diacritics, a unicode normalization (java text) to retrieve basic letter code + diacritic combining code might help. Something like:

import java.text.Normalizer;
newString = Normalizer.normalize(string,
        Normalizer.Form.NFKD).replaceAll("\\p{M}", "");
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
3

You can use StringUtils.replaceEach, like this:

private String changeSwedishCharactersAndWhitespace(String string) {
    String newString = StringUtils.replaceEach (string, 
      new String[] {"å", "ä", "ö", "Å", "Ä", "Ö", " "}, 
      new String[] {"a", "a", "o", "A", "A", "O", "-"});
    return newString;
}
ShyJ
  • 4,560
  • 1
  • 19
  • 19
3

I think there is not a common regex for replacing these characters at once. Apart from that, you can facilitate your replacement work by using a HashMap.

HashMap<String, String> map = new HashMap<String, String>()
                              {{put("ä", "a"); /*put others*/}};

for (Map.Entry<String, String> entry : map.entrySet())
    newString = string.replaceAll(entry.getKey(), entry.getValue());
Juvanis
  • 25,802
  • 5
  • 69
  • 87
0

You can write your own mapper usen the matcher.find method:

public static void main(String[] args) {
    String from = "äöÂ";
    String to   = "aoA";
    String testString = "Hellö Wärld";

    Pattern p = Pattern.compile(String.format("[%s]", from));
    Matcher m = p.matcher(testString);
    String result = testString;
    while (m.find()){
        char charFound = m.group(0).charAt(0);
        result = result.replace(charFound, to.charAt(from.indexOf(charFound)));
    }

    System.out.println(result);
}

this will replace

Hellö Wärld

with

Hello Warld
Rodrigo
  • 400
  • 3
  • 7