9

My website supports a number of Indian languages. The user can change the language dynamically. When user inputs some string value, I have to split the string value into its individual characters. So, I'm looking for a way to write a common function that will work for English and a select set of Indian languages. I have searched across sites, however, there appears to be no common way to handle this requirement. There are language-specific implementations (for example Open-Tamil package for Tamil implements get_letters) but I could not find a common way to split or iterate through the characters in a unicode string taking the graphemes into consideration.

One of the many methods that I've tried:

name = u'தமிழ்'
print name
for i in list(name):
  print i

#expected output
தமிழ்
த
மி
ழ்

#actual output
தமிழ்
த
ம
ி
ழ
்

#Here is another an example using another Indian language
name = u'हिंदी'
print name
for i in list(name):
  print i

#expected output
हिंदी
हिं
दी

#actual output
हिंदी
ह
ि  
ं 
द
ी
user1928896
  • 514
  • 1
  • 4
  • 16

3 Answers3

10

To get "user-perceived" characters whatever the language, use \X (eXtended grapheme cluster) regular expression:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex # $ pip install regex

for text in [u'தமிழ்', u'हिंदी']:
    print("\n".join(regex.findall(r'\X', text, regex.U)))

Output

த
மி
ழ்
हिं
दी
jfs
  • 399,953
  • 195
  • 994
  • 1,670
8

The way to solve this is to group all "L" category characters with their subsequent "M" category characters:

>>> regex.findall(ur'\p{L}\p{M}*', name)
[u'\u0ba4', u'\u0bae\u0bbf', u'\u0bb4\u0bcd']
>>> for c in regex.findall(ur'\p{L}\p{M}*', name):
...   print c
... 
த
மி
ழ்

regex

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • Hi, did you mean 'regex' or 're'? I tried 're.findall(ur'\p{L}\p{M}*', name)' and it returned an empty list. – user1928896 Oct 11 '15 at 19:29
  • 1
    I meant "regex". Which is why I wrote "regex". And included a link to `regex`. – Ignacio Vazquez-Abrams Oct 11 '15 at 23:02
  • As it turns out, I cannot use the `regex` module in my app engine application since `regex` is not pure python but includes `c` extension. Is there an alternative solution to this problem using Python's `re`module or some other means of achieving this? – user1928896 Oct 13 '15 at 01:07
  • 1
    You'll have to use `unicodedata.category()` to get the category of each character in turn and group them accordingly. – Ignacio Vazquez-Abrams Oct 13 '15 at 01:38
  • While this may work in this particular case, `\X` is the preferred mechanism for pulling out individual grapheme clusters. – tchrist Oct 13 '15 at 13:07
  • This solution is not correct. It won't work for combined emojis, for example country flags. – kxmh42 Jul 14 '19 at 18:25
2

uniseg works really well for this, and the docs are OK. The other answer to this question works for international Unicode characters, but falls flat if users enter Emoji. The solution below will work:

>>> emoji = u''
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for c in list(grapheme_clusters(emoji)):
...     print c
...




This is from pip install uniseg==0.7.1.

Aidan Fitzpatrick
  • 1,950
  • 1
  • 21
  • 26