23

I need to convert Strings that consists of some letters specific to certain languages (like HÄSTDJUR - note Ä) to a String without those special letters (in this case HASTDJUR). How can I do it in Java? Thanks for help!


It is not really about how it sounds. The scenario is following - you want to use the application, but don't have the Swedish keyboard. So instead of looking at the character map, you type it by replacing special letters with the typical letters from the latin alphabet.

Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
grem
  • 273
  • 2
  • 3
  • 6
  • 1
    HASTDJUR? Germans would expect HAESTDJUR. You seem to assume some particular rules, can you state them explicitly ? – MSalters Sep 14 '10 at 10:27
  • 2
    A few more cases for you to ponder over: IJ => IJ ? Æ => AE ? DŽ => DZ ? ß => ss ? Ʀ => R ? ð => ? Δ => D ? – MSalters Sep 14 '10 at 10:29
  • 1
    @MSalters Once you see Haemaelaeinen written somewhere, you don't want to convert ä to ae any more... – Carlos Sep 14 '10 at 10:41
  • Well, it is Swedish so I know what to expect :) – grem Sep 14 '10 at 13:31

2 Answers2

61

I think your question is the same as this one:

Java - getting rid of accents and converting them to regular letters

and hence the answer is also the same:

Solution

String convertedString = 
       Normalizer
           .normalize(input, Normalizer.Form.NFD)
           .replaceAll("[^\\p{ASCII}]", "");

References

See

Example Code:

final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
    Normalizer
        .normalize(input, Normalizer.Form.NFD)
        .replaceAll("[^\\p{ASCII}]", "")
);

Output:

This is a funky String

Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
  • seanizer - I need to test it but seems to be the solution. – grem Sep 14 '10 at 13:32
  • 1
    This does not appear to deal with composite characters very well (Æ, Œ). – Weckar E. Jan 22 '18 at 15:37
  • @WeckarE. for ligatures, an additional step is required, which is outlined here: https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2013/docs/designDoc/UDF/unicode/NormOperations/splitLigatures.html (End of Page) – Sean Patrick Floyd Jan 22 '18 at 21:03
1

I'd suggest a mapping, of special characters, to the ones you want.

Ä --> A
é --> e
A --> A (exactly the same)
etc...

And then you can just call your mapping over your text (in pseudocode):

for letter in string:
   newString += map(letter)

Effectively, you need to create a set of rules for what character maps to the ASCII equivalent.

Noel M
  • 15,812
  • 8
  • 39
  • 47
  • I am unfortunate and don't know whether `Ä` sounds like `A` or something else. :) – Adeel Ansari Sep 14 '10 at 10:29
  • 2
    Who said anything about sounds like? This question seems to be just about removing the decorations on the letters, to put it crudely. – Noel M Sep 14 '10 at 10:30
  • May be not. I couldn't infer that from the question. Are you going on example provided? See the comments on the question, to know what I mean. – Adeel Ansari Sep 14 '10 at 10:33
  • How would you create such a table, and how would you effectively use it? – MSalters Sep 14 '10 at 10:34
  • @MSalters: That's another question. Can be done with some predefined rules, I suppose. – Adeel Ansari Sep 14 '10 at 10:36
  • @MSalters This is just one way. There are probably much better ways (1) create Maptable=new HashMap(); table.put('Ä','A');.... (2) use Character unicode ; ... Character ascii=table.get(unicode) ; – emory Sep 14 '10 at 10:39
  • It is not really about how it sounds. The scenario is following - you want to use the application, but don't have the Swedish keyboard. So instead of looking at the character map, you type it by replacing special letters with the typical letters from the latin alphabet. – grem Sep 14 '10 at 13:33