I am using ICU4J and trying to merge transliteration rules. For the end result I need to have all German umalaut characters converted to their DIN 5007-2 alternatives and all non ASCII characters converted to their ASCII versions.
When I try to do this like this:
import com.ibm.icu.text.Transliterator;
public class Main
{
public static void main(String[] args)
{
Transliterator latinASCII = Transliterator.getInstance("Latin-ASCII");
String german_DIN_5007_2Rules ="$beforeLower = [[:Mn:][:Me:]]* [:Lowercase:];\n" +
"\\u00e4 > ae;\n" +
"\\u00f6 > oe;\n" +
"\\u00fc > ue;\n" +
"\\u00c4 } $beforeLower > Ae;\n" +
"\\u00d6 } $beforeLower > Oe;\n" +
"\\u00dc } $beforeLower > Ue;\n" +
"\\u00c4 > AE;\n" +
"\\u00d6 > OE;\n" +
"\\u00dc > UE;\n";
//"\\u00df > ss;\n";
String latinASCIIRules = latinASCII.toRules(true);
String germanASCIIRules = latinASCIIRules + german_DIN_5007_2Rules;
Transliterator germanASCII = Transliterator.createFromRules("german_DIN_5007_2", germanASCIIRules, Transliterator.FORWARD);
String result1 = germanASCII.transliterate("Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß");
String result2 = germanASCII.transliterate("Ç,ü,é,â,ä,à,ç,ê,ë,è,ï,î,ì,Ä,Å,É,æ,Æ,ô,ö,ò,û,ù,Ô,Û,Ã,ã,Ñ,Õ,õ,Ä,Ë,Ï,Ö,Ü,Ÿ,Ç,Œ,œ,ū,Ð,ð,Ċ,ċ,Ġ,ġ,ů,Ů,š,Š,Ě,ť,ž,Ć,Ł,Ó,Ź,ą,ę,ń,ś,ż,ÿ,Ö,Ü,á,í,ó,ú,ñ,Ñ,À,È,Ì,Ò,Ù,Á,É,Í,Ó,Ú,Ý,Â,Ê,Î,ß,Ø,ø,Å,å,Þ,þ,Ā,Ē,Ī,Ō,Ū,ā,ē,ī,ō,ě,Ů,ů,Č,č,Ď,ď,Ľ,ľ,Ň,ň,Ř,ř,Š,š,Ť,Ž,Ą,Ę,Ń,Ś,Ż,ć,ł,ó,ź, ,/");
System.out.println(result1);
System.out.println(result2);
}
}
I get:
Hauser Baume Hofe Garten dass U u o a A O ss
C,u,e,a,a,a,c,e,e,e,i,i,i,A,A,E,ae,AE,o,o,o,u,u,O,U,A,a,N,O,o,A,E,I,O,U,Y,C,OE,oe,u,D,d,C,c,G,g,u,U,s,S,E,t,z,C,L,O,Z,a,e,n,s,z,y,O,U,a,i,o,u,n,N,A,E,I,O,U,A,E,I,O,U,Y,A,E,I,ss,O,o,A,a,TH,th,A,E,I,O,U,a,e,i,o,e,U,u,C,c,D,d,L,l,N,n,R,r,S,s,T,Z,A,E,N,S,Z,c,l,o,z, ,/
That is incorrect because German umlauts are not converted
ä → ae
ö → oe
ü → ue
Ä → Ae
Ö → Oe
Ü → Ue
If I revert the order for germanASCIIRules like this:
String germanASCIIRules = german_DIN_5007_2Rules + latinASCIIRules;
I get:
Exception in thread "main" com.ibm.icu.impl.IllegalIcuArgumentException: Compound filters misplaced
at com.ibm.icu.text.TransliteratorParser.parseRules(TransliteratorParser.java:1101)
at com.ibm.icu.text.TransliteratorParser.parse(TransliteratorParser.java:867)
at com.ibm.icu.text.Transliterator.createFromRules(Transliterator.java:1413)
at com.stepstone.Main.main(Main.java:26)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
If I don't combine the rules, and user only german_DIN_5007_2Rules like this:
Transliterator germanASCII = Transliterator.createFromRules("german_DIN_5007_2", german_DIN_5007_2Rules, Transliterator.FORWARD);
I get:
Haeuser Baeume Hoefe Gaerten daß UE ue oe ae AE OE ß
Ç,ue,é,â,ae,?,ç,?,ë,?,?,î,?,AE,?,É,?,?,ô,oe,?,?,?,Ô,?,?,?,?,?,?,AE,Ë,?,OE,UE,?,Ç,?,?,?,?,?,?,?,?,?,ů,Ů,š,Š,Ě,ť,ž,Ć,Ł,Ó,Ź,ą,ę,ń,ś,ż,?,OE,UE,á,í,ó,ú,?,?,?,?,?,?,?,Á,É,Í,Ó,Ú,Ý,Â,?,Î,ß,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,ě,Ů,ů,Č,č,Ď,ď,Ľ,ľ,Ň,ň,Ř,ř,Š,š,Ť,Ž,Ą,Ę,Ń,Ś,Ż,ć,ł,ó,ź, ,/
Here umlauts are transliterated correctly, but all the remaining characters are messed up :(