11

I'm looking for java library which allow "normalization" of text. Something similar to standart Normalizer, but wider (something like utf8proc LUMP).

It should replace all kind of special charachters to ASCII equivalents (if it possible of course). All variants of space to code 32, all variants of minuses (long, short, thin, etc) to code 45 and so on.

valodzka
  • 5,535
  • 4
  • 39
  • 50

3 Answers3

4

Your specific requirements are a bit vague, but I suppose you want a thing that does what Normalizer does, but with the feature to lump together certain Unicode code points to one character - similar to utf8proc.

I would go for a 2-step approach:

  1. First use Normalizer.normalize to create whatever (de-)composition you want
  2. Then iterate through the code points of the result and replace unify the characters the way you like it.

Both should be straightforward. For 2, if you are dealing with characters out of the Basic Multilingual Pane, then iterate through the code points using an appropriate algorithm for doing so. If you are using only BMP code points, then simply iterate over the characters.

For the characters you would like to lump together, create a substitution data structure for the mapping ununified code point -> unified code point. Map<Character, Character> or Map<Integer, Integer> come to mind for that. Populate the substitution map to your liking, e.g. by taking the information from utf8proc's lump.txt and a source for character categories.

Map<Character, Character> LUMP;

static {
  LUMP = new HashMap<Character, Character>();
  LUMP.put('\u2216', '\\'); // set minus
  LUMP.put('\u007C', '|'); // divides
  // ...
}

Create a new StringBuilder or something similar with the same size as your normalized string. When iterating over the code points, check if LUMP.get(codePoint) is non-null. In this case, add the value returned, otherwise add the code point to the StringBuilder. That should be it.

If required, you can support a way of loading the contents of LUMP from a configuration, e.g. from a Properties object.

Community
  • 1
  • 1
nd.
  • 8,699
  • 2
  • 32
  • 42
2

You should look at the Latin-ASCII transform in CLDR. it will be in ICU 4.6

Steven R. Loomis
  • 4,228
  • 28
  • 39
1

Have you looked into icu4j's Normalizer?

normalize transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. normalize supports the standard normalization forms described in Unicode Standard Annex #15 — Unicode Normalization Forms.

Robert Munteanu
  • 67,031
  • 36
  • 206
  • 278
  • Yes, I checked it. By default it doesn't do what I need. I've looked on Normalizer2 (http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer2.html), it can be configured, but it isn't simple task. – valodzka Nov 05 '10 at 23:04