Java string searching ignoring accents - part II

Question

This question is a continuation of Java string searching ignoring accents.

The answer to the original question shows us how to remove the diacritics from strings. So, for instance, köln becomes koln. But łódź becomes łodz - note the l with stroke.

My question is how can I remove the stroke as well, so that łódź becomes lodz?

Thanks.

You were given the wrong answer. See my comment below. – tchrist Jun 03 '12 at 04:55 — tchrist, Jun 03 '12 at 04:55

Joey · Accepted Answer · 2012-05-30T07:54:54.497

2

You cannot, at least not trivially for all such letters. The letter ł is (except for appearance and its Unicode name) not linked to l at all (in Unicode at least; linguistically that's a different matter).

Your only option might be a conversion table for your use case you can fill with all the characters you need to convert.

edited May 30 '12 at 07:54

answered May 30 '12 at 07:48

Joey

344,408
85
689
683

**This answer is incorrect!!** According to the current DUCET used by the Unicode Collation Algorithm, the primary collation strength for U+0142 `LATIN SMALL LETTER L WITH STROKE` (that `ł` character) is identical to that of a normal `LATIN SMALL LETTER L`. The correct answer is to compare strings using the Unicode Collation Algorithm but with the strength set to primary (level one) only. You will probably have to use ICU if you’re stuck with Java, because the Sun libraries do not correctly implement the UCA. – tchrist Jun 03 '12 at 04:54
Admitted, I didn't look at what they actually want to do and took this question as »How can I create a new string where `ł` gets converted to `l`?« I guess that *would* be difficult using collation algorithm (barring enumeration of all possible strings). So I was mainly looking at decomposition. I can't delete until unaccept, though. – Joey Jun 03 '12 at 07:01
I want the unaccented string indeed. – mark Jun 03 '12 at 18:15

score 1 · Answer 2 · answered Nov 22 '12 at 06:57

As tchrist suggested, I attempted to use ICU (V 50.1): it didn't recognize it as derived from L either. The L with stroke seems to be a special case in Unicode. Look at http://bugs.mysql.com/bug.php?id=11369 They say in Unicode 4.0 it was not connected to L, while in Unicode 4.1 it is. I wonder if anyone tested the problem with a Unicode4.1-based Java library.

Java string searching ignoring accents - part II

2 Answers2

Linked