FIRST, I've used the Python 3 grapheme library to solve my problem. (For a bit more about grapheme, see this article). But I'm surprised that Python 3 couldn't do this without a specialized library...
I resorted to grapheme because after many web searches and reading of StackOverflow questions, I couldn't get Python 3 to return the correct number of character positions in a sequence of Thai characters.
For example, here's a UTF-8 string of Thai characters:
thai_str = 'สีโชคดีเป็นสีชมพู สีโชคร้ายเป็นสีเหลืองและขาว'
I use the term character position to identify a single position in a line/string of Thai characters. That's because a character position may consist of a Thai consonant plus, in some cases, a vowel or tone marker above or below that consonant. The consonant plus the vowel or tone marker above/below occupies a single character position in the Unicode string. (Some Thai consonants may also have vowels to their left, right, or both. Those vowels occupy their own character position.)
For example, in the following sequence generated from the example string, items 2 and 7 are vowels, and item 10 is a tone marker. Each consume separate bytes in the UTF-8 string but don't occupy their own character positions. Items 3 and 8 are vowels that go to the left of a consonant and so occupy character positions.
01: ส
02: ี
03: โ
04: ช
05: ค
06: ด
07: ี
08: เ
09: ป
10: ็
...
45: ว
When trying to determine the character positions in the example string, len(thai_str)
returns 45
. Which isn't correct. The only way I've been able to do get the correct number of character positions is to use grapheme.length(thai_str)
to get 35
.
I've also used encode to get the following:
b'\xe0\xb8\xaa\xe0\xb8\xb5\xe0\xb9\x82\xe0\xb8\x8a\xe0\xb8\x84\xe0\xb8\x94...
(Counting the instances of xe0
that seem to precede every Thai character doesn't feel like the correct approach...)
SO -- is the only way to count character positions in my example string be to use a Python 3 library such as grapheme?