1

I want to perform Unicode text normalization in the Bengali language. For example: Consider the sentence: প্রায়শ্চিত্ত - মনীন্দ্র ও তার পড়াশুনা and প্রায়শ্চিত্ত - মণীন্দ্র ও তার পড়াশুনা both differ in their Unicode values in the following ways (Notice the difference in ন and ণ in the first and second sentence of the word মনীন্দ্র):

SENTENCE 1: প্রায়শ্চিত্ত - মনীন্দ্র ও তার পড়াশুনা

[('প', 2474), ('্', 2509), ('র', 2480), ('া', 2494), ('য়', 2527), ('শ', 2486), ('্', 2509), ('চ', 2458), ('ি', 2495), ('ত', 2468), ('্', 2509), ('ত', 2468), (' ', 32), ('-', 45), (' ', 32), ('ম', 2478), ('ন', 2472), ('ী', 2496), ('ন', 2472), ('্', 2509), ('দ', 2470), ('্', 2509), ('র', 2480), (' ', 32), ('ও', 2451), (' ', 32), ('ত', 2468), ('া', 2494), ('র', 2480), (' ', 32), ('প', 2474), ('ড়', 2524), ('া', 2494), ('শ', 2486), ('ু', 2497), ('ন', 2472), ('া', 2494)]

SENTENCE 2: প্রায়শ্চিত্ত - মণীন্দ্র ও তার পড়াশুনা

[('প', 2474), ('্', 2509), ('র', 2480), ('া', 2494), ('য়', 2527), ('শ', 2486), ('্', 2509), ('চ', 2458), ('ি', 2495), ('ত', 2468), ('্', 2509), ('ত', 2468), (' ', 32), ('-', 45), (' ', 32), ('ম', 2478), ('ণ', 2467), ('ী', 2496), ('ন', 2472), ('্', 2509), ('দ', 2470), ('্', 2509), ('র', 2480), (' ', 32), ('ও', 2451), (' ', 32), ('ত', 2468), ('া', 2494), ('র', 2480), (' ', 32), ('প', 2474), (' ড়', 2524), ('া', 2494), ('শ', 2486), ('ু', 2497), ('ন', 2472), ('া', 2494)]

I had found this library https://github.com/csebuetnlp/normalizer for normalization but it is not showing any difference in the Unicode values after normalizing the input text. Also from using https://github.com/anoopkunchukuttan/indic_nlp_library text normalization is happening only for limited characters like poorna viram('|' full stop). Any suggestions in performing the normalization would be helpful.

Detailed Explanation:

The issue I am trying to mention is that the Unicode values of the same character are not consistent. If I am searching for a string "apple" where 'a' has Unicode value 200 and there are two candidate strings out of n total strings present in the system. String 1 contains "apple" wherein 'a' has Unicode value 200 and String 2 contains "apple" wherein 'a' has Unicode value 300 then I want both String 1 and String 2 to show up. Currently, only String 1 will show up because it is totally matching with the query string.

Both ন and ণ are the same characters, but they are treated differently since their Unicode values are different. For this particular case, I can replace ণ with ন. I am doing this because when I am performing a string search and I want to get words containing 'ন' and 'ণ'. However, there can be cases where some other letters have such ambiguity, or maybe ন is written in some other fashion where its Unicode value is different than 2472 and 2467. I want to know about a principled approach to handling this scenario.

P.S. It will also be really helpful if you can point me to any Bengali language-specific resource to get the canonical representations.

learner
  • 99
  • 8
  • From what I can tell, the highlighted characters are simply distinct with regard to Unicode and there are no normalization rules involving them. This means that you will have to look for some language-specific solution outside of Unicode standards. – nwellnhof Jan 12 '22 at 12:35
  • If you can direct me to a Bengali language-specific resource it will be really helpful. – learner Jan 12 '22 at 15:10

0 Answers0