Java library for text normalization

Question

I'm looking for java library which allow "normalization" of text. Something similar to standart Normalizer, but wider (something like utf8proc LUMP).

It should replace all kind of special charachters to ASCII equivalents (if it possible of course). All variants of space to code 32, all variants of minuses (long, short, thin, etc) to code 45 and so on.

PS: Look like I have to implement it. Any ideas how to do it? — valodzka, Nov 08 '10 at 10:36

score 4 · Answer 1 · edited May 23 '17 at 12:33

Your specific requirements are a bit vague, but I suppose you want a thing that does what Normalizer does, but with the feature to lump together certain Unicode code points to one character - similar to utf8proc.

I would go for a 2-step approach:

First use Normalizer.normalize to create whatever (de-)composition you want
Then iterate through the code points of the result and replace unify the characters the way you like it.

Both should be straightforward. For 2, if you are dealing with characters out of the Basic Multilingual Pane, then iterate through the code points using an appropriate algorithm for doing so. If you are using only BMP code points, then simply iterate over the characters.

For the characters you would like to lump together, create a substitution data structure for the mapping ununified code point -> unified code point. Map<Character, Character> or Map<Integer, Integer> come to mind for that. Populate the substitution map to your liking, e.g. by taking the information from utf8proc's lump.txt and a source for character categories.

Map<Character, Character> LUMP;

static {
  LUMP = new HashMap<Character, Character>();
  LUMP.put('\u2216', '\\'); // set minus
  LUMP.put('\u007C', '|'); // divides
  // ...
}

Create a new StringBuilder or something similar with the same size as your normalized string. When iterating over the code points, check if LUMP.get(codePoint) is non-null. In this case, add the value returned, otherwise add the code point to the StringBuilder. That should be it.

If required, you can support a way of loading the contents of LUMP from a configuration, e.g. from a Properties object.

score 2 · Accepted Answer · answered Nov 09 '10 at 15:43

2

You should look at the Latin-ASCII transform in CLDR. it will be in ICU 4.6

answered Nov 09 '10 at 15:43

Steven R. Loomis

4,228
28
39

Thank you, look like a good solution – valodzka Nov 09 '10 at 16:39
1

The Latin-ASCII transliterator went into ICU 4.6 / CLDR 1.9. – Steven R. Loomis Jul 29 '11 at 00:10

score 1 · Answer 3 · answered Nov 05 '10 at 22:55

1

Have you looked into icu4j's Normalizer?

normalize transforms Unicode text into an equivalent composed or decomposed form, allowing for easier sorting and searching of text. normalize supports the standard normalization forms described in Unicode Standard Annex #15 — Unicode Normalization Forms.

answered Nov 05 '10 at 22:55

Robert Munteanu

67,031
36
206
278

Yes, I checked it. By default it doesn't do what I need. I've looked on Normalizer2 (http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer2.html), it can be configured, but it isn't simple task. – valodzka Nov 05 '10 at 23:04

Java library for text normalization

3 Answers3

Linked