0

I am trying to convert arabic characters to unicode. But for the same string, I am getting different unicodes.

Here is the Unicode for "hello" in Arabic translated using google translate: مرحبا The unicode for that is: "%u0645%u0631%u062D%u0628%u0627"

Here is the Unicode for "hello" in Arabic translated using deep translator in python: ﻣﺮﺣﺒﺎ The unicode for that is: "%uFEE3%uFEAE%uFEA3%uFE92%uFE8E"

Why a I getting different unicodes for the same thing? I am getting different font over here but if you try to convert these unicodes you get the same thing as shown in the image below.

I want the unicode to start with "%u06" only and not "%uFE".

enter image description here

derikS4M1
  • 89
  • 9
  • 1
    Those letters are - `ﻣ` (U+FEE3, *Arabic Letter Meem Initial Form*) and `م` (U+0645, *Arabic Letter Meem*); – JosefZ Feb 12 '22 at 19:50
  • You should normalize the text. One version uses "presentation form" which exists for compatibility with old encoding. Now you just describe the Arabic letter. The font engine (or better the shaper engine) will select the right form (isolated, initial, middle, final), and much more: good ligatures. Ideally a good normalization which translates only such kind of compatibility, and not other transformations (e.g. replacing subscripts, other symbols/units letters, etc.). – Giacomo Catenazzi Feb 14 '22 at 08:26

1 Answers1

2

I cannot tell you why there are different ways to encode this Arabic text snippet in Unicode, because I know almost nothing about the Arabic writing system. But since you tagged your question with python, I may give you some tooling to investigate further:

>>> import unicodedata as ud
>>> for l in 'مرحبا':
...     print(ud.name(l))
ARABIC LETTER MEEM
ARABIC LETTER REH
ARABIC LETTER HAH
ARABIC LETTER BEH
ARABIC LETTER ALEF

>>> for l in 'ﻣﺮﺣﺒﺎ':
...     print(ud.name(l))
ARABIC LETTER MEEM INITIAL FORM
ARABIC LETTER REH FINAL FORM
ARABIC LETTER HAH INITIAL FORM
ARABIC LETTER BEH MEDIAL FORM
ARABIC LETTER ALEF FINAL FORM

So, to me it looks like there are different codepoints for typographic variants of the letters (similar to the separate Unicode codepoints for ligatures like "œ" and "fi" in the Latin script). But there might be a different reason for the differences.

lenz
  • 5,658
  • 5
  • 24
  • 44