2

We want to rename strings that way that "strange" characters like German umlauts are translated to their official non-umlaut representation. In Java, is there some function to convert such characters (AKA handle the mapping), not only for the German umlauts, but also for French, Czech or Scandinavian characters? The reason is to create a function that could rename files/directories that could be handled without problems on different platforms by Subversion.

This question is similar but without a useful answer.

Community
  • 1
  • 1
Thomas S.
  • 5,804
  • 5
  • 37
  • 72
  • 1
    possible duplicate of [How to replace umlauts in a string?](http://stackoverflow.com/questions/20420080/how-to-replace-umlauts-in-a-string) – Werner Kvalem Vesterås Mar 09 '15 at 14:02
  • Or a duplicate of this: http://stackoverflow.com/questions/1234510/how-do-i-replace-a-character-in-a-string-in-java Could be either. – Layna Mar 09 '15 at 14:04
  • 1
    The ä does not correspond to ae, the ö does not correspond to the oe, and the ü does not correspond to ue. These are different characters. You may, however, translate them to regular (e.g ä -> a) by creating a map with the corresponding characters, checking if that character is in that map and if so replacing it. – lacraig2 Mar 09 '15 at 14:04
  • The Scandinavian characters å,ä,ö,ø,æ are actually not umlauts, they are separate characters. So there is no such official translation. – stenix Mar 09 '15 at 14:09
  • 3
    The term for this operation is transliteration. – Jere Käpyaho Mar 09 '15 at 14:10
  • Umlaut is a misnomer. In this context it is just the German-originated name for the diacritic above the base character. The letters åäö can be expressed in Unicode as canonical decompositions of the base character and the diacritic. – Jere Käpyaho Mar 09 '15 at 14:21
  • 1
    @lacraig2: at least in German it is completely common to replace ä, ö, ü with ae, oe and ue if umlauts are not allowed or possible to enter, e.g. on a US keyboard. – Thomas S. Mar 09 '15 at 14:30
  • @ThomasS. I'm not familiar with umlaut use in German, but from my admittedly limited experience it makes a difference in French. – lacraig2 Mar 09 '15 at 14:31

3 Answers3

5

Use the ICU Transliterator. It is a generic class for performing these kinds of transliterations. You may need to provide your own map.

Jere Käpyaho
  • 1,305
  • 1
  • 10
  • 29
4

You can use the Unicode block property \p{InCombiningDiacriticalMarks} to remove (most) diacritical marks from Strings:

public String normalize(String input) {
  String output = Normalizer.normalize(input, Normalizer.Form.NFD); 
  Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");

  return pattern.matcher(output).replaceAll("");
}

This will not replace German umlauts the way you desire, though. It will turn ö into o, ä into a and so on. But maybe that's okay for you, too.

user1438038
  • 5,821
  • 6
  • 60
  • 94
3

Answer is Any-Latin; De-ASCII; Latin-ASCII;

PHP specific answer using Transliterator (sorry for not providing Java code)

$val = 'BEGIN..Ä..Ö..Ü..ä..ö..ü..ẞ..ß..END';
echo Transliterator::create('Any-Latin; De-ASCII; Latin-ASCII;')->transliterate($val);
// output
//    BEGIN..AE..OE..UE..ae..oe..ue..SS..ss..END

Normal ASCII rule is Any-Latin; Latin-ASCII; (BEGIN..A..O..U..a..o..u..SS..ss..END)

Rules should work in any language with support for ICU = International Components for Unicode.

hrvoj3e
  • 2,512
  • 1
  • 23
  • 22