1

I have the following code in Java to replace the characters with tildes like:

á é í ó ú Á É Í Ó Ú à è ì ò ù À È Ì Ò Ù 

text = text.replace( "á", "a" );
    text = text.replace( "é", "e" );
    text = text.replace( "í", "i" );
    text = text.replace( "ó", "o" );
    text = text.replace( "ú", "u" );

    // caracteres raros: tildes mayusculas
    text = text.replace( "Ã", "A" );
    text = text.replace( "É", "E" );
    text = text.replace( "Ã", "I" );
    text = text.replace( "Ó", "O" );
    text = text.replace( "Ú", "U" );


    // caracteres raros: tildes inversas minusculas
    text = text.replace( "à", "a" );
    text = text.replace( "è", "e" );
    text = text.replace( "ì", "i" );
    text = text.replace( "ò", "o" );
    text = text.replace( "ù", "u" );

    // caracteres raros: tildes inversas mayusculas
    text = text.replace( "À", "A" );
    text = text.replace( "È", "E" );
    text = text.replace( "Ì", "I" );
    text = text.replace( "Ã’", "O" );
    text = text.replace( "Ù", "U" );

    // caracteres raros: ñ minuscula y mayuscula
    text = text.replace( "Ñ", "n" );
    text = text.replace( "ñ", "N" );

I want to use a notation like:

text = text.replace( "\uD1232", "N" );

But i don't know where to find a table with that characters: ... À, È, Ì ...

Sequoya
  • 433
  • 4
  • 16
  • You shouldn't do this manually, use [`Normalizer`](http://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html) instead; that's what it's designed for. – Mick Mnemonic May 16 '17 at 22:32
  • Possible duplicate of [Easy way to remove UTF-8 accents from a string?](http://stackoverflow.com/questions/15190656/easy-way-to-remove-utf-8-accents-from-a-string) – Mick Mnemonic May 16 '17 at 22:32

2 Answers2

0

The JDK contains a tool named native2ascii.

Create a text file in UTF-8 encoding with the special characters.

For example file in.txt:

á é í ó ú Á É Í Ó Ú à è ì ò ù À È Ì Ò Ù 

Then call:

native2ascii -encoding UTF-8 in.txt out.txt

After that your file out.txt contains the escape sequences like that:

\u00e1 \u00e9 \u00ed \u00f3 \u00fa \u00c1 \u00c9 \u00cd \u00d3 \u00da \u00e0 \u00e8 \u00ec \u00f2 \u00f9 \u00c0 \u00c8 \u00cc \u00d2 \u00d9 
vanje
  • 10,180
  • 2
  • 31
  • 47
0

Part seems to be originally UTF-8 encoded text erroneously interpreted as maybe ISO-8859-1 (Latin-1) or such.

The following is a successfull attempt to repair it:

public static void main(String[] args) throws IOException {
    p1("Ã ", "a");
    p1("Ã\u00a0", "a"); // Non-breaking space instead
    p1("è", "e");
    p1("ì", "i");
    p1("ò", "o");
    p1("ù", "u");

    // caracteres raros: tildes inversas mayusculas
    p1("À", "A");
    p1("È", "E");
    p1("Ì", "I");
    p1("Ã’", "O");
    p1("Ù", "U");

    // caracteres raros: ñ minuscula y mayuscula
    p1("Ñ", "n");
    p1("ñ", "N");
}

static void p1(String s, String t) {
    String v = new String(s.getBytes(StandardCharsets.ISO_8859_1),
            StandardCharsets.UTF_8);
    String u = Normalizer.normalize(v, Normalizer.Form.NFD)
            .replaceAll("\\pM", "");
    if (u.equalsIgnoreCase(t)) {
        System.out.printf("[1] %s -> %s :: %s%n", v, u, t);
    } else {
        p2(s, t);
    }
}

static void p2(String s, String t) {
    String v = new String(s.getBytes(Charset.forName("Windows-1252")),
            StandardCharsets.UTF_8);
    String u = Normalizer.normalize(v, Normalizer.Form.NFD)
            .replaceAll("\\pM", "");
    System.out.printf("[2] %s -> %s :: %s%n", v, u, t);
}

[2] �  -> �  -> a
[1] à -> a :: a
[1] è -> e :: e
[1] ì -> i :: i
[1] ò -> o :: o
[1] ù -> u :: u
[2] À -> A -> A
[2] È -> E -> E
[2] Ì -> I -> I
[2] Ò -> O -> O
[2] Ù -> U -> U
[2] Ñ -> N -> n
[1] ñ -> n :: N

As you can see n/N is evidently mixed up. and the first entry with the space is evidently corrupted. s = s.replace(' ', '\u00a0'); would do.

The code above uses a Normalizer to throw away the accents, by splitting accented letters in basic letter and combining diacritical marks. Removing the latter by replaceAll.

  • UTF-8 is a Unicode charset
  • ISO-8859-1 is Latin-1, subset of UTF-8
  • Windows-1252 is Windows Latin-1, a "superset" of Latin-1.

(The code above might best be edited and compiled in a java source with UTF-8 encoding to not have surprises.)

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138