Splitting Thai text by characters

Question

Not by word boundaries, that is solvable.

Example:

#!/usr/bin/env python3  
text = 'เมื่อแรกเริ่ม'  
for char in text:  
    print(char)

This produces:
เ
ม

อ
แ
ร
ก
เ
ร

ม

Which obviously is not the desired output. Any ideas?

A portable representation of text is:

text = u'\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e41\u0e23\u0e01\u0e40\u0e23\u0e34\u0e48\u0e21'

While the "obviously wrong" nature of the output is apparent to you, it will not be to most of us. What makes it wrong? — Maurice Reeves, May 07 '15 at 14:28
Thai text is dificult for latin oriented users. Some characters with marks it splits on several fields (3), like 3 utf8 characters, like for example 3.th character in text — josifoski, May 07 '15 at 14:30
Take a look at this: http://stackoverflow.com/questions/13826331/how-to-split-a-thai-sentence-which-does-not-use-spaces-into-words — rafaelc, May 07 '15 at 14:31
I can't reproduce desired output since stackoverflow copy/paste is not representing well those characters (it acts similar to python split) — josifoski, May 07 '15 at 14:32
I would find it helpful if you could: 1) provide what you would like the desired output to be; 2) provide an ascii string of unicode character identifiers for your sample ( u'\u0e40' , etc) — tom10, May 07 '15 at 14:56

dawg · Accepted Answer · 2018-01-14T13:43:40.567

11

tl;dr: Use \X regular expression to extract user-perceived characters:

>>> import regex # $ pip install regex
>>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม')
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']

While I do not know Thai, I know a little French.

Consider the letter è. Let s and s2 equal è in the Python shell:

>>> s
'è'
>>> s2
'è'

Same letter? To a French speaker visually, oui. To a computer, no:

>>> s==s2
False

You can create the same letter either using the actual code point for è or by taking the letter e and adding a combining code point that adds that accent character. They have different encodings:

>>> s.encode('utf-8')
b'\xc3\xa8'
>>> s2.encode('utf-8')
b'e\xcc\x80'

And differnet lengths:

>>> len(s)
1
>>> len(s2)
2

But visually both encodings result in the 'letter' è. This is called a grapheme, or what the end user considers one character.

You can demonstrate the same looping behavior you are seeing:

>>> [c for c in s]
['è']
>>> [c for c in s2]
['e', '̀']

Your string has several combining characters in it. Hence a 9 grapheme character Thai string to your eyes becomes a 13 character string to Python.

The solution in French is to normalize the string based on Unicode equivalence:

>>> from unicodedata import normalize
>>> normalize('NFC', s2) == s
True

That does not work for many non Latin languages though. An easy way to deal with unicode strings that may be multiple code points composing a single grapheme is with a regex engine that correctly deals with this by supporting \X. Unfortunately Python's included re module doesn't yet.

The proposed replacement, regex, does support \X though:

>>> import regex
>>> text = 'เมื่อแรกเริ่ม'
>>> regex.findall(r'\X', text)
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
>>> len(_)
9

edited Jan 14 '18 at 13:43

answered May 07 '15 at 15:41

dawg

98,345
23
131
206

tnx for your efort (upvote is from me), there might be something in this direction, however utf-8 and thai are not best friends @dawg – josifoski May 07 '15 at 15:59
I had also looked at normalize, and it did not worked for the Thai characters. But `regex` seems to be a really nice tool :-) – Serge Ballesta May 07 '15 at 17:42
cool solution with newest regex @dawg I can't make two accepted answers – josifoski May 07 '15 at 18:19
Using regex with `\X` is more robust. Serge Ballesta's solution is only combining the characters for console output -- not in a logical fashion. – dawg May 07 '15 at 18:30
hmm @dawg is pattern r'\X' for matching single characters in all languages (not only thai?). If yes, than solution is robust! – josifoski May 07 '15 at 18:36
1

Yes -- any language. It takes a regular letter and combines it with all following combination marks. – dawg May 07 '15 at 18:45
you win @dawg . Congratulation :) – josifoski May 07 '15 at 18:51
`s2` is not a grapheme: it is a [grapheme *cluster*](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) the so-called "user-perceived character". For clarity, you could use explicit Unicode code points numbers such as `è` ([U+00e8](http://codepoints.net/U+00e8)) or `è` ([U+0065](http://codepoints.net/U+0065) [U+0300](http://codepoints.net/U+0300)). – jfs May 11 '15 at 12:17
note: `\X` regex handles eXtended grapheme clusters such as `กำ` (U+0E01 U+0E33). It doesn't work for Tailored grapheme clusters such as Slovak `ch` digraph (U+0063 U+0068). – jfs May 11 '15 at 12:22
I've added summary. Feel free to rollback. – jfs May 11 '15 at 12:27

Serge Ballesta · Answer 2 · 2015-05-07T15:45:22.700

I cannot exactly reproduce, but here is a slight modified version of you script, with the output on IDLE 3.4 on a Windows7 64 system :

>>> for char in text:
    print(char, hex(ord(char)), unicodedata.name(char),'-',
          unicodedata.category(char), '-', unicodedata.combining(char), '-',
          unicodedata.east_asian_width(char))


เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
ื 0xe37 THAI CHARACTER SARA UEE - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
อ 0xe2d THAI CHARACTER O ANG - Lo - 0 - N
แ 0xe41 THAI CHARACTER SARA AE - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ก 0xe01 THAI CHARACTER KO KAI - Lo - 0 - N
เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ิ 0xe34 THAI CHARACTER SARA I - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
>>>

I really do not know what those characters can be - my Thai is very poor :-) - but it shows that :

text is acknowledged to be Thai ...
output is coherent with len(text) (13)
category and combining are different when characters are combined

If it is expected output, it proves that your problem is not in Python but more on the console where you display it. You should try to redirect output to a file, and then open the file in an unicode editor supporting Thai characters.

If expected output is only 9 characters, that is if you do not want to decompose composed characters, and provided there are no other composing rules that should be considered, you could use something like :

def Thaidump(t):
    old = None
    for i in t:
        if unicodedata.category(i) == 'Mn':
            if old is not None:
                old = old + i
        else:
            if old is not None:
                print(old)
            old = i
    print(old)

That way :

>>> Thaidump(text)
เ
มื่
อ
แ
ร
ก
เ
ริ่
ม
>>>

Tnx @serge-ballesta , i'm reading carefully your answer. Problem is that len(text) should be 9, not 13. It seems strategy using utf-8 better to change. Reading — josifoski, May 07 '15 at 15:12
@josifoski : this comes beyond my Thai knowledge, that's why I added `unicodedata.category` and `combining`. By mixing that all, it is possible to display 9 characters only by combining decomposed characters, **provided there are no other special rules to considere** — Serge Ballesta, May 07 '15 at 15:34
let me check @serge-ballesta your newest function in python3 — josifoski, May 07 '15 at 15:47
also to mention, here on stack splitted characters are not represented well, but while executing python script in terminal are ok. But i'll have to check are they single utf-8 characers or for those problematic their len is > 1 — josifoski, May 07 '15 at 15:55

score 2 · Answer 3 · answered Jul 20 '17 at 13:34

For clarification of the previous answers, the issue you have is that the missing characters are "combining characters" - vowels and diacritics that must be combined with other characters in order to be displayed properly. There is no standard way to display these characters by themselves, although the most common convention is to use a dotted circle as a null consonant as shown in the answer by Serge Ballesta.

The question is then, for your application are each vowel and diacritic considered a separate character or do you wish to separate by "print cell" as shown in Serge's answer ?

By the way, in normal usage the lead vowels SARA E and SARA AE should not be displayed without a following consonant except in the process of typing a longer word.

For more information, see the WTT 2.0 standard published by the Thai API Consortium (TAPIC) which defines how characters can be combined, displayed and how to cope with errors.

Splitting Thai text by characters

3 Answers3