1

I am using the following link to create a hashmap of key = unicode value of characters and value being the actual character it should map to - https://github.com/lmjabreu/solr-conftemplate/blob/master/mapping-ISOLatin1Accent.txt

So far I have written the following code to remove accents from the string

public class ACCENTS {

    public static void main(String[] args){

        // this is the hashmap that stores the mappings of the characters to their ascii equivalent
        HashMap<Character, Character> characterMappings = new HashMap<>();

        characterMappings.put('\u00C0', 'A');
        characterMappings.put('\u00C1', 'A');
        characterMappings.put('\u00C2', 'A');
        characterMappings.put('\u00C3', 'A');
        characterMappings.put('\u00C4', 'A');
        characterMappings.put('\u00C5', 'A');
        characterMappings.put('\u00C7','C');
        characterMappings.put('\u00C8', 'E');
        characterMappings.put('\u00C9','E');
        characterMappings.put('\u00CA', 'E');
        characterMappings.put('\u00CB', 'E');
        characterMappings.put('\u00CC', 'I');
        characterMappings.put('\u00CD', 'I');
        characterMappings.put('\u00CE', 'I');
        characterMappings.put('\u00CF', 'I');
        characterMappings.put('\u00D0', 'D');
        characterMappings.put('\u00D1', 'N');
        characterMappings.put('\u00D2', 'O');
        characterMappings.put('\u00D3', 'O');
        characterMappings.put('\u00D4', 'O');
        characterMappings.put('\u00D5', 'O');
        characterMappings.put('\u00D6', 'O');
        characterMappings.put('\u00D8', 'O');
        characterMappings.put('\u00D9', 'U');
        characterMappings.put('\u00DA', 'U');
        characterMappings.put('\u00DB', 'U');
        characterMappings.put('\u00DC', 'U');
        characterMappings.put('\u00DD', 'Y');
        characterMappings.put('\u0178', 'Y');
        characterMappings.put('\u00E0', 'a');
        characterMappings.put('\u00E1', 'a');
        characterMappings.put('\u00E2', 'a');
        characterMappings.put('\u00E3','a');
        characterMappings.put('\u00E4', 'a');
        characterMappings.put('\u00E5', 'a');
        characterMappings.put('\u00E7', 'c');
        characterMappings.put('\u00E8', 'e');
        characterMappings.put('\u00E9', 'e');
        characterMappings.put('\u00EA','e');
        characterMappings.put('\u00EB', 'e');
        characterMappings.put('\u00EC', 'i');
        characterMappings.put('\u00ED', 'i');
        characterMappings.put('\u00EE', 'i');
        characterMappings.put('\u00EF', 'i');
        characterMappings.put('\u00F0', 'd');
        characterMappings.put('\u00F1','n' );
        characterMappings.put('\u00F2', 'o');
        characterMappings.put('\u00F3', 'o');
        characterMappings.put('\u00F4', 'o');
        characterMappings.put('\u00F5', 'o');
        characterMappings.put('\u00F6', 'o');
        characterMappings.put('\u00F8', 'o');
        characterMappings.put('\u00F9', 'u');
        characterMappings.put('\u00FA', 'u');
        characterMappings.put('\u00FB', 'u');
        characterMappings.put('\u00FC', 'u');
        characterMappings.put('\u00FD', 'y');
        characterMappings.put('\u00FF', 'y');

        String token = "nа̀ра";
        String newString = "";


        for(int i = 0 ; i < token.length() ; ++i){
            if( characterMappings.containsKey(token.charAt(i)) )
                newString += characterMappings.get(token.charAt(i));
            else
                newString += token.charAt(i);
        }

        System.out.println(newString);
    }
}

The expected result should have been "napa" but it turns out no conversion is being performed, what can be a possible cause of deviation for this case, I am not able to find one.

AnkitSablok
  • 3,021
  • 7
  • 35
  • 52
  • Have you tried using a string with other special characters, such as "\u00FF\u00FD\u0178", to see if the hashmap itself works as intended? – IllusiveBrian Sep 27 '13 at 16:50
  • Your `characterMappings` map doesn't actually seem to include the character 'p' with an accent. – Louis Wasserman Sep 27 '13 at 16:52
  • The hashmap itself is working but not for the cyrillic characters :( – AnkitSablok Sep 27 '13 at 16:54
  • Have a look at this SO link. Perhaps the issue is when you create the `String token`. http://stackoverflow.com/questions/5729806/encode-string-to-utf-8 – Meesh Sep 27 '13 at 16:56
  • it says no string has an enocoding in java its only bytes that have an encoding, so what should I do now? The problem however is to map UTF-8 data to ASCII data – AnkitSablok Sep 27 '13 at 17:00
  • Is this what you are looking for: http://blog.smartkey.co.uk/2009/10/how-to-strip-accents-from-strings-using-java-6/ ? – Scheintod Sep 27 '13 at 17:29

2 Answers2

5

Not shure why you want to use a HashMap. But if you just want to remove the diacritics perhaps this helps:

String s = "nа̀ра";
s = Normalizer.normalize( s, Normalizer.Form.NFD );
s = s.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
System.out.println( s );

--> napa

(If you insist on using the HashMap you should have still a look a the 'Normalizer' class because it can work in the other direction, too.)

Taken from this article: http://blog.smartkey.co.uk/2009/10/how-to-strip-accents-from-strings-using-java-6/

Scheintod
  • 7,953
  • 9
  • 42
  • 61
  • You're welcome. Unicode is a pain in the ass. (But it's much less than the situation was before, trust me!) – Scheintod Sep 27 '13 at 17:59
1

you ran into some of the ugliest 'features' of Java: One unicode character may be represented by a tupel (and even a tripel) of characters.

In fact, token has a length of 5 chars. á is a combination of two chars and can only be represented as a String.

This is why

 characterMappings.put('а̀`', 'y'); //(accent can't be displayed correctly in code-mode, try it yourself)

won't compile.

Here is a more explaination.

In my humble oppinion String is one of the worst classes in Java. Especially if you use 'non standard' characters.

To solve your problem I would suggest changing your map to Map<String,String> or Map<String,Character>. This way you can map your 'characters' and as a neat sideeffect your code becomes more readable if you dismiss the escaped unicode-characters.

For more information google for HighSurrogate or CodePoint. CodePoints are valid (=displayable) char-sequences, which - as mentioned before - need not to necessarily correspond with the number of chars in a String.

This is necessary because a Java-Character is just 2 byte wide. To small for all unicode characters, but big enough most of the time (=as long as you use standard latin characters).

Edit:

Even with a Map<String,String>, your code won't work, cause you still loop over chars. But no single Java-character will match you special unicode-character.

This might help, though it may not work under any circumstances (java strings are nasty after all):

HashMap<String, String> characterMappings = new HashMap<>();
characterMappings.put("а̀", "a");

String token = "nа̀ра";
String newString = "";

for (Entry<String, String> e : characterMappings.entrySet()) {
    token = token.replaceAll(e.getKey(), e.getValue());
}
System.out.println(token);

Edit 2

Since posting code as a comment sucks:

    String s = "brûlée";
    String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
    String regex = "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+";

    String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"),
            "ascii");

    System.out.println(s2);

this works for me with everything I tried so far. Still @Scheintod deserves the credit. Source found here

Best regards

sam

Community
  • 1
  • 1
samjaf
  • 1,033
  • 1
  • 9
  • 19
  • well if its not matching the characters in the hashmap will it match string? – AnkitSablok Sep 27 '13 at 17:10
  • Please keep in mind the stackoverflow codesection doesn't handle accents nicely, too. Please adept the example to your case. It looks differently in my eclipse, because the accent is over "a", not over p. The `put`-statement is malformated as well in the answer. – samjaf Sep 27 '13 at 17:37
  • @Scheintod has a far superior solution. You should not use my naiv implementation. – samjaf Sep 27 '13 at 17:41