0

I used python:

for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):
    for i in m:
        print(i, i.encode('unicode-escape'))
    print('--------')

the results show ल्ली has 2 hindi characters:

ल b'\\u0932'
् b'\\u094d'
--------
ल b'\\u0932'
ी b'\\u0940'
--------

it's wrong, actually ल्ली is one hindi character. How to get hindi character(such as ल्ली) by how many unicode compose.

In short, I want to split 'कृपयाल्ली' to 'कृ','प','या','ल्ली'

Ismael Padilla
  • 5,246
  • 4
  • 23
  • 35
GGS of Future
  • 73
  • 1
  • 1
  • 4
  • You misunderstand. In short, I want to split 'कृपयाल्ली' to 'कृ','प','या','ल्ली'. – GGS of Future Jul 31 '20 at 07:37
  • You can recycle the same answer. Just keep together chars that are `combining`, and then put a `ZWNJ` between the characters. You may adapt it depending how you want to handle virama. Look the Indic language chapter of Unicode Standard, for more information – Giacomo Catenazzi Jul 31 '20 at 08:15

2 Answers2

1

I'm not quite sure if this is correct, being Finnish and not well versed in Hindi, but this would merge characters with any subsequent Unicode Mark characters:

import unicodedata


def merge_compose(s: str):
    current = []
    for c in s:
        if current and not unicodedata.category(c).startswith("M"):
            yield current
            current = []
        current.append(c)
    if current:
        yield current


for group in merge_compose("कृपयाल्ली"):
    print(group, len(group), "->", "".join(group))

The output is

['क', 'ृ'] 2 -> कृ
['प'] 1 -> प
['य', 'ा'] 2 -> या
['ल', '्'] 2 -> ल्
['ल', 'ी'] 2 -> ली
AKX
  • 152,115
  • 15
  • 115
  • 172
  • pretty sure this is it, I was trying to decipher the results of `unicodedata.category` but since I always work in english, I had little idea – juanpa.arrivillaga Jul 31 '20 at 07:53
  • The issue is also muddied by the fact that if you copy-paste some of OP's strings through e.g. the macOS console, they get modified... – AKX Jul 31 '20 at 08:03
  • think you, unicodedata.category() is right method, but that's not how the called. – GGS of Future Aug 04 '20 at 11:15
0

I found answer in other question.

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

#print(list(splitclusters(word)))
GGS of Future
  • 73
  • 1
  • 1
  • 4
  • 1
    If you find an answer in other places, please add also the link where you found the answer. It is very useful in few years: most answers will be obsolete, and just on few of them will have the updated version, for e.g. new python versions, or new unicode data package. Remember: we are a reference site, so we should write questions and answers also for other people (which will look at this in future) – Giacomo Catenazzi Jul 31 '20 at 08:56
  • @GiacomoCatenazzi I think [this](https://stackoverflow.com/a/6806203/5320906) is the answer that the OP refers to – snakecharmerb Jul 31 '20 at 15:39
  • [link](https://stackoverflow.com/questions/6805311/combining-devanagari-characters/6805416#6805416) – GGS of Future Aug 04 '20 at 11:11