5

I have a Urdu word "لاعلم" and more similar words. How can I split the word that I get "لا" and "علم" separately in an array? I have tried converting the words to unicode characters, but I can,t detect the break between "لا" and "علم".

English words can be easily separated based on spaces, but I am stuck on separating Urdu words, where there are no spaces.

  • You don't have spaces at all in Urdu? Because in Arabic they use similar letters, but they **will** write a white space between "لا" and "علم", if they were two separate words... – Itay Dec 05 '15 at 10:09
  • How did you solve this? – Adil Soomro Mar 29 '20 at 20:58

1 Answers1

5

There is no space because its a single word meaning "ignorant." As a matter of fact, "لا" and "علم" separated wouldn't mean anything.

Space is inserted in Urdu (and Arabic script) for a practical need to demarcate words when the font would automatically ligature it with adjoining characters. The only way one can undo the ligature is by inserting a superfluous space between characters. Technically, the ZERO WIDTH NON-JOINER (U+200C) is precisely for this purpose but human beings are slow to learn and space is easy to insert.

There are some characters that don't join with following letters, for example, "ا" wouldn't join with any following character but can with a preceding character like "ل" to form the ligature "لا." You can use this list of characters (same rules for Arabic) and write a custom toneizer that ends a word after "Right Joining" characters, ZWNJ or a space.

Battlefury
  • 247
  • 1
  • 4
  • 20
  • Also you can use `​` to insert a `Zero Width Space` betwen لا and علم – Reza Aghaei Dec 05 '15 at 10:43
  • @RezaAghaei Should not! its a single word. – Battlefury Dec 05 '15 at 11:15
  • Thank you all for your replies. Actually I am working on a Urdu stemming application, that will have the functionality to extract Prefix, Stem and Postfix from the input word. So I guess there is no other way to detect the invisible separator between Urdu words. – user3699181 Dec 05 '15 at 11:23