4

I want to properly separate each letter/character of an Arabic/Persian word, in an isolated letter/character without changing it's face, and persisting their medial/initial/isolated/final form,

Here's an example:

Regular segmentation:

بابا ====>  ب ا ب ا

شاهین ====> ش ا ه ی ن

Desired segmentation:

بابا ====> بـ ـابـ ـا

شاهین ====> شـ ـاهـ یـ ـن
Shaheen Zahedi
  • 1,216
  • 3
  • 15
  • 40
  • 1
    Half-done attempt at https://tio.run/##jVJNTwIxEL33V7zsqRVcVzR6WD0gZxMTj8Bh7RYslHazLSgx3v0nevPj1@CvwakCUYOJaV7avs6beZnpqJgVu65SdlSOl8tqemW0hDSF9zgvtMUdW3E@FIG2mdMlJvTCL0Ot7RDdflEPvQC7YyvG4xTJ4mnxGJHk7HLug5qkbhrSigKCsdyLnPm0VpUppGobw5Ne7yxpIulNW1lWJuJPFaNjtw9ZURGrbhCvPpWuVBeOzh03tYFnTfjUKDsM11yIncN@zgauBqcAaBJmOW0n2K7DRkhRjYagFuzt/XbDv2nb8eYGA6/C2byzZn3MpgVlYbLqajJBhf@tyqkmvL4lMyGo2oO/P6C5eCO8El4Iz4RHAVlYOGvmuFIw2o5VieAQrpWuUdVKqjLO5CtNurLS2I9mstvY67W9RmvN5UwPwDemI3nUOo59WEUebNQyZ/dQxqttj5T6ntaWQcbBfX0WKtPEj3FhB4fYxYEQ@NQvlx8 – ninjalj Jul 20 '20 at 15:29
  • 1
    The basic idea is to use ZERO WIDTH JOINER and ZERO WIDTH NON JOINER at the appropiate side of each character.There are six characters that don't ligate to the next character, so they need special handling. Also, I dont know what to do with the LAM+ALIF ligature. – ninjalj Jul 20 '20 at 15:32

1 Answers1

1

You could use Normalizer to achieve this. Take a look here for more info.

Something like:

 String segmented = Normalizer.normalize(input, Form.NFKD).replaceAll("\\p{M}", "");
aran
  • 10,978
  • 5
  • 39
  • 69
  • The link here, optimizes arabic input query for search purposes e.g removes 'HAMZA'. but I want to separate each character from a word, these two are way different – Shaheen Zahedi Jul 11 '20 at 16:33