Your specific requirements are a bit vague, but I suppose you want a thing that does what Normalizer does, but with the feature to lump together certain Unicode code points to one character - similar to utf8proc.
I would go for a 2-step approach:
- First use Normalizer.normalize to create whatever (de-)composition you want
- Then iterate through the code points of the result and replace unify the characters the way you like it.
Both should be straightforward. For 2, if you are dealing with characters out of the Basic Multilingual Pane, then iterate through the code points using an appropriate algorithm for doing so. If you are using only BMP code points, then simply iterate over the characters.
For the characters you would like to lump together, create a substitution data structure for the mapping ununified code point -> unified code point. Map<Character, Character>
or Map<Integer, Integer>
come to mind for that. Populate the substitution map to your liking, e.g. by taking the information from utf8proc's lump.txt and a source for character categories.
Map<Character, Character> LUMP;
static {
LUMP = new HashMap<Character, Character>();
LUMP.put('\u2216', '\\'); // set minus
LUMP.put('\u007C', '|'); // divides
// ...
}
Create a new StringBuilder or something similar with the same size as your normalized string. When iterating over the code points, check if LUMP.get(codePoint)
is non-null. In this case, add the value returned, otherwise add the code point to the StringBuilder. That should be it.
If required, you can support a way of loading the contents of LUMP from a configuration, e.g. from a Properties object.