How to sort and search in text while ignoring diacritics of all kinds?

Question

Background

Various languages have what's called "Diacritics" . Special signs that come with "normal" letters, one way or another. They might change how the letters sound, or just give a hint about how they are supposed to be sound.

The problem

When searching and sorting strings using the basic way, it uses the Unicode value of the characters, so things can seem to be in the wrong order for sorting, or not found for searching.

Searching should allow me to find the occurrences of a string within another, including not just that they exist, but also where.

If I take the string "Le Garçon" in French, for example, and search for "rc" it would find it on position of "r" and ends with the position of "ç". Finding the locations is important in case you wish to highlight where the text was found.

What I've found

Collator and CollationKey can help for sorting: https://stackoverflow.com/a/75334111/878126

Normalizer might help for searching as it replaces letters that have Diacritic: https://stackoverflow.com/a/10700023/878126

But, these don't seem to cover some languages. I know Hebrew for example, and in Hebrew, it has Niqqud (equivalent to Vowels in English but are optional) signs, which, as a Unicode characters, are added after the letter. That's even though the sign itself is shown inside/around the letter.

https://en.wikipedia.org/wiki/Diacritic#Hebrew

In this case, normalization of the word doesn't do anything, and so searching for the text and sorting it becomes a problem.

Example is:

val regex = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+").toRegex()
val string = "בְּרֵאשִׁית"
val length = string.length // this is 11 and not 6 as it seems for other languages
val normalized = Normalizer.normalize(string, Normalizer.Form.NFD)
val result = normalized.replace(regex, "") // this still becomes the same exact value as on the original, instead of "בראשית"

I was told (here) that perhaps ICU4J library could help with these 2 operations (search and sort), but I can't find this information.

The questions

Is there a better solution in Java/Kotlin API to have searching&sorting while ignoring Diacritics? One that includes as many languages as possible?

Is it possible ICU4J can help? If so, how? I couldn't find much information and samples about how to use it for this purpose in Java/Kotlin.

Is Hebrew the main concern you have? That is, you either are working only with Hebrew, or you are satisfied with your solution for other languages? Do you receive the text you are sorting and searching from some other source that consistently normalizes it to some form? For example, maybe your target text is stripped of niqqud and matres lectionis and consists only of consonants? (I am not familiar with Hebrew, just reading Wikipedia.) — erickson, Feb 16 '23 at 22:32
Hebrew and English are the only languages I know (I know a tiny bit French but it doesn't count). As for normalizing Hebrew, I wrote that what I've found doesn't do anything to the input. It ends up exactly as it started from. My guess is that as Hebrew is similar in some ways to Arabic and Persian, the issue probably exist there too. In Hebrew, the special signs of Niqqud are very optional, because people are familiar with the same words without them. If you check Hebrew text on the Internet, you will almost never see Niqqud. On the bible text you will see because it's hard to read it. — android developer, Feb 16 '23 at 22:51

score 0 · Answer 1 · answered Feb 16 '23 at 20:50

0

Try this. It will normalize your string for search:

    String s = "çéèïïÔé";
    s= Normalizer.normalize(s, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    System.out.println(s.toString());

answered Feb 16 '23 at 20:50

Philippe Fery

120
5

This is similar to what I wrote that I tried. It works for the given input you've chosen, but not for the example I wrote ("בְּרֵאשִׁית" should become "בראשית") – android developer Feb 16 '23 at 21:44

erickson · Answer 2 · 2023-02-19T01:07:00.377

In the case of your example, בְּרֵאשִׁית, the "diacritics" don't actually appear to be classified as diacritics in Unicode. They are in the category "non-spacing marks," Mn.

This regex satisfies your test: [\\p{IsHebrew}&&\\p{IsMn}] I don't know Hebrew script, so whether it causes problems elsewhere, or misses some other elements of Hebrew script, I can't tell.

Here is a test demonstrating [\\p{IsHebrew}&&\\p{IsMn}]:

import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;

public class SO75476483 {

    @Test
    public void inTheBeginning() {
        var niqqud = "[\\p{IsHebrew}&&\\p{IsMn}]";
        var text = "בְּרֵאשִׁית";
        int length = text.length();
        Assertions.assertEquals(11, length);
        String actual = text.replaceAll(niqqud, "");
        Assertions.assertEquals("בראשית", actual);
    }

}

Equivalence and sorting rules for the same characters are different in different locales. It follows ineluctably that you must select a specific locale appropriate for each use. There are no universal rules that work for everyone.

For search applications, you'll segregate documents by language and build a separate index for each group, using the language-appropriate collator. When making a query, the user will provide a keyword and its language tag (though the language is likely to be implied, for example via the Accept-Language header in an HTTP request). The language is used to select an appropriate collator and an index to search with the resulting collation key.

Here is a test demonstrating the right way to approach this problem (in memory), with a Collator.

    @Test
    public void collateInTheBeginning() {
        var hebrewCollator = Collator.getInstance(Locale.forLanguageTag("he"));
        hebrewCollator.setStrength(Collator.PRIMARY);

        var hebrewIndex = new HashMap<CollationKey, String>();
        var document = "בְּרֵאשִׁית";
        var ref = "Gen. 1:1";
        hebrewIndex.put(hebrewCollator.getCollationKey(document), ref);

        var query = "בראשית";
        String actual = hebrewIndex.get(hebrewCollator.getCollationKey(query));

        Assertions.assertEquals(ref, actual);
    }

Of course, many applications have too much text to index to keep all this in memory using CollationKey instances. Most relational databases support collations internally, if the proper one is specified when a column is defined. Of course, a decent full-text search engine will provide equivalent capabilities.

In the worst case, a CollationKey can be converted to a byte array in the application, and used as a key for searching, range queries, and sorting in nearly any type of external database.

While Arabic and Hebrew are abjads, where vowels can be inferred, you should be aware that this not representative. Abugidas like Devanagari are commonly used, and stripping vowel marks from these scripts would make text illegible.

Decomposing characters with a Normalizer will allow you you remove non-spacing marks, but to be safe, you would need to limit this behavior to abjads (mostly scripts coming down from Samaritan or Aramaic).

On the other hand, a Collator set with with the proper language and a PRIMARY strength will handle this distinction for you, and ignore marks that don't matter.

Not sure what you wanted to put, so I tried both options : `Pattern.compile("[\\\\p{IsHebrew}&&\\\\p{IsMn}]").toRegex()` and `Pattern.compile("[\\p{IsHebrew}&&\\p{IsMn}]").toRegex()` . Both didn't do anything to the input. As I wrote, for the example of "בְּרֵאשִׁית" it should become without the dots, meaning: "בראשית" . Also, Hebrew is only one example. I don't know other language, but perhaps Arabic and Persian would have the same issue. — android developer, Feb 17 '23 at 00:00
Seems to work fine now. Why no usage of the Normalizer? Isn't it the official way to do it? And, what about all other languages? I know only English and Hebrew... The question was for all languages (or at least, most of them). I only noticed the issue on Hebrew because that's one of the languages I know. Also, noticed a bug on the IDE about the regular expression, so reported here : https://youtrack.jetbrains.com/issue/KT-56752/Bug-IDE-tells-to-change-Regex-to-something-that-actually-breaks-how-it-works — android developer, Feb 17 '23 at 08:29
@androiddeveloper My approach is suitable for scripts with niqqud-like, non-spacing marks. My understanding is that these are customarily inferred by proficient readers of that language and script; in other words, people are used to reading text without these marks everyday. You would know better if French is like that. If someone searches for "côte" and gets back hits for "coté", is that frustrating or just what they expect? Hebrew doesn't require decomposition to get rid of these marks, but other scripts will, because there are composed equivalent Unicode characters. — erickson, Feb 17 '23 at 16:42
I would strongly advise using language-specific processing to normalize your text for searching and sorting; trying to create one giant regex for everything is unwieldy to maintain and inefficient at runtime. You should localize your filters instead. I would also take a harder look at `CollationKey`, as this is exactly the problem it's designed to solve, where `Normalizer` is indirectly related. — erickson, Feb 17 '23 at 16:46
But these classes are already what I talked about , and you can see that they are not complete (can't handle Hebrew). I'm searching for a solution that works for all/most languages. Maybe even a library as it might get updated sometimes. — android developer, Feb 17 '23 at 20:06
@androiddeveloper "you can see that they are not complete" No, I don't see that. That's why I recommended that you look harder at them. I provided yet another example. Good day to you. — erickson, Feb 17 '23 at 23:42
Why can't I use a generic solution that's for all languages? Why should the user select a language? A search query can consist of characters from multiple languages. Isn't the Collator useful for sorting, while Normalizer is useful for searching? I don't understand what you did in the function collateInTheBeginning. If it's for sorting, you are supposed to store the original string mapped to the key to be able to have a nice caching, as I've demonstrated here: https://stackoverflow.com/a/75334111/878126 — android developer, Feb 18 '23 at 11:35

How to sort and search in text while ignoring diacritics of all kinds?

Background

The problem

What I've found

The questions

2 Answers2

Linked