1
s1='ফটিকছড়ি' #escape-unicode= %u09AB%u099F%u09BF%u0995%u099B%u09A1%u09BC%u09BF
s2='ফটিকছড়ি' #escape-unicode= %u09AB%u099F%u09BF%u0995%u099B%u09DC%u09BF

They are looking the same but are different. How can I consider them as the same string?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Rafiqul Islam
  • 205
  • 2
  • 13
  • Have you take a look to the [homoglyphs](https://pypi.org/project/homoglyphs/) library? – ndclt Sep 18 '22 at 13:39

1 Answers1

0

In Unicode, the character U+09DC is canonically equivalent to the sequence U+09A1 U+09BC. When you compare Unicode strings, you should always use Unicode normalization to fold together canonically equivalent sequences. So, convert both strings to Unicode normalization form C or Unicode normalization form D before comparing.

See UAX #15 Unicode Normalization Forms for details on Unicode normalization.

See this answer for how to normalize Unicode strings in Python.

Peter Constable
  • 2,707
  • 10
  • 23
  • FYI, only the normalization to the decomposed form works. U+09A1 U+09BC does not compose to U+09DC, but U+09DC decomposes to U+09A1 U+09BC. – Mark Tolonen Sep 18 '22 at 16:41
  • @MarkTolonen Ah, yes! In most cases, either NFC or NFD would work but there are a small number of exceptions, and this is one of them. – Peter Constable Sep 19 '22 at 16:15