How to compare Bengali homoglyph words or characters in Python?

Question

s1='ফটিকছড়ি' #escape-unicode= %u09AB%u099F%u09BF%u0995%u099B%u09A1%u09BC%u09BF
s2='ফটিকছড়ি' #escape-unicode= %u09AB%u099F%u09BF%u0995%u099B%u09DC%u09BF

They are looking the same but are different. How can I consider them as the same string?

Have you take a look to the [homoglyphs](https://pypi.org/project/homoglyphs/) library? — ndclt, Sep 18 '22 at 13:39

score 0 · Accepted Answer · answered Sep 18 '22 at 15:31

0

In Unicode, the character U+09DC is canonically equivalent to the sequence U+09A1 U+09BC. When you compare Unicode strings, you should always use Unicode normalization to fold together canonically equivalent sequences. So, convert both strings to Unicode normalization form C or Unicode normalization form D before comparing.

See UAX #15 Unicode Normalization Forms for details on Unicode normalization.

See this answer for how to normalize Unicode strings in Python.

answered Sep 18 '22 at 15:31

Peter Constable

2,707
10
23

FYI, only the normalization to the decomposed form works. U+09A1 U+09BC does not compose to U+09DC, but U+09DC decomposes to U+09A1 U+09BC. – Mark Tolonen Sep 18 '22 at 16:41
@MarkTolonen Ah, yes! In most cases, either NFC or NFD would work but there are a small number of exceptions, and this is one of them. – Peter Constable Sep 19 '22 at 16:15

How to compare Bengali homoglyph words or characters in Python?

1 Answers1