7

I'm trying to convert all Latin unicode Character into their [a-z] representations

ó --> o
í --> i

I can easily do one by one for example:

myString = myString.replaceAll("ó","o");

but since there are tons of variations, this approach is just impractical

Is there another way of doing it in Java? for example a regular Expression, or a utility library

USE CASE:

1- city names from another languages into english e.g.

Espírito Santo --> Espirito Santo,

nafas
  • 5,283
  • 3
  • 29
  • 57
  • http://stackoverflow.com/a/25057742/984823 But still be aware of some exceptions like l-stroke. – Joop Eggen Sep 22 '15 at 13:16
  • This is a very crude approach for your use case. In German, in situations where only ASCII can be displayed, an umlaut is replaced by an e after the character, eg. München becomes Muenchen. And the actual English name of that city is Munich. I'd suggest just leave the accents. If you application is not able to display those accents then your application is horribly broken. – roeland Sep 22 '15 at 23:18
  • @roeland yes I understand that, the problem is that imagine München in many different languages, each language have it differently. now imagine in big data trying to analyze all this data... well the way I'm thinking might not give us the right city but it atleast tries to normalize it "as much as possible" (there is a saying if the rate is over 80% its good enough). this is what we are aiming for – nafas Sep 23 '15 at 09:42
  • @nafas Ah I understand – roeland Sep 23 '15 at 22:34

1 Answers1

15

This answer requires Java 1.6 or above, which added java.text.Normalizer.

    String normalized = Normalizer.normalize(input, Normalizer.Form.NFD);
    String accentRemoved = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

Example:

public class Main {
    public static void main(String[] args) {
        String input = "Árvíztűrő tükörfúrógép";
        System.out.println("Input: " + input);
        String normalized = Normalizer.normalize(input, Normalizer.Form.NFD);
        System.out.println("Normalized: " + normalized);
        String accentRemoved = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
        System.out.println("Result: " + accentRemoved);
    }
}

Result:

Input: Árvíztűrő tükörfúrógép
Result: Arvizturo tukorfurogep
isapir
  • 21,295
  • 13
  • 115
  • 116
EpicPandaForce
  • 79,669
  • 27
  • 256
  • 428
  • 1
    @JoopEggen I don't know how it works, I just know it works :) – EpicPandaForce Sep 22 '15 at 13:18
  • 1
    @EpicPandaForce I'm not trying to replace them with "".for example I want to repalce "í" with "i" – nafas Sep 22 '15 at 13:22
  • 1
    The first normalize replaces single char `í` with ASCII `i` plus zero-width `´`. Then all those accents, combining diacritical marks, are deleted. Remain the ASCII letters. D in NFD stands for Decompose. – Joop Eggen Sep 22 '15 at 13:25
  • 2
    I had no idea this class was part of core java. Thanks for enlightening me! – ControlAltDel Sep 22 '15 at 13:26
  • 1
    @EpicPandaForce it works great mate, at first by looking at the code, It gave me a bad impression, but its fantastic – nafas Sep 22 '15 at 13:27
  • @EpicPandaForce using `InCombiningDiacriticalMarks` consider is a bad approach. Use {Mn} instead. https://stackoverflow.com/a/5697575/4866465 – daniel gi Mar 14 '21 at 10:48
  • hwo to deal with disambiguous character like Ł , Đ. just simply doing normalization won't work. any idea for such characters? – Rinku Chowdhury Oct 29 '21 at 09:22
  • I'm not sure, I didn't encounter such letters in our problem – EpicPandaForce Oct 29 '21 at 09:25