0

I am using polyglot to tokenize text in Burmese language. Here is what I am doing.

    from polyglot.text import Text

    blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
    text = Text(blob)

When I do :

print(text.words)

It outputs in the following format:

[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c', u'\u1000\u1039\u103b', u'\u1019', u'\u1014\u1039', u'\u1019\u102c', u'\u101c\u102f', u'\u1015\u1039', u'\u101e\u102c\u1038', u'\u1019\u103a\u102c\u1038', u'\u1000\u102d\u102f', u'\u101c\u102f\u1036', u'\u107f', u'\u1001\u1033\u1036\u1031', u'\u101b\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015\u107f', u'\u1015\u102e\u1038', u'\u1011\u102d\u102f', u'\u1004\u1039\u1038', u'\u101b\u1032', u'\u1006', u'\u1000\u1039', u'\u101c', u'\u1000\u1039', u'\u1016', u'\u1019\u1039\u1038', u'\u1006\u102e\u1038', u'\u104a', u'\u1027', u'\u100a\u1037\u1039', u'\u1005\u102c', u'\u101b', u'\u1004\u1039\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015', u'\u104d', u'\u1012', u'\u100f\u1039\u1031', u'\u1004\u103c\u1090\u102d\u102f', u'\u1000\u1039']

What output is this? I am not sure why the output is like this. How could I convert it back to the format where I could make some sense out of this?

I had also tried the following:

text.words[1].decode('unicode-escape')

but it throws an error saying: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Mohd Shahid
  • 1,538
  • 2
  • 33
  • 66
  • Possible duplicate of [Python print unicode strings in arrays as characters, not code points](https://stackoverflow.com/questions/5648573/python-print-unicode-strings-in-arrays-as-characters-not-code-points) – Ken Y-N Oct 23 '18 at 08:40
  • @KenY-N I had tried this. But it throws an error: `UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)` – Mohd Shahid Oct 23 '18 at 08:41
  • Maybe [this will help](https://stackoverflow.com/questions/25103575/unicodeencodeerror-ascii-codec-cant-encode-characters-in-position-0-3-ordin)? Upgrading to Python 3 might be for the best... – Ken Y-N Oct 23 '18 at 08:42
  • When you print `blob` does it print correctly? If so, what happens when you print the strings in the `text.words` list one by one? – PM 2Ring Oct 23 '18 at 09:08

2 Answers2

2

That is the way Python 2 prints a list. It is debugging output (see repr()), that unambiguously indicates the content of a list. u'' indicates a Unicode string and \uxxxx indicates a Unicode code point of U+xxxx. The output is all ASCII so it works on any terminal. If you directly print the strings in the list, they will display correctly if your terminal supports the characters being printed. Example:

words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print words
for word in words:
    print word

Output:

[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
ထို
င္းေ
ရာ

To reemphasize, your terminal must be configured with an encoding that supports the Unicode code points (ideally, UTF-8), and use a font that supports the characters as well. Otherwise, you can print the text to a file in UTF-8 encoding, and view the file in an editor that supports UTF-8 and has fonts that support the characters:

import io
with io.open('example.txt','w',encoding='utf8') as f:
    for word in words:
        f.write(word + u'\n')

Switch to Python 3, and things get more simple. It defaults to displaying the characters if the terminal supports it, but you can still get the debugging output as well:

words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print(words)
print(ascii(words))

Output:

['ထို', 'င္းေ', 'ရာ']
['\u1011\u102d\u102f', '\u1004\u1039\u1038\u1031', '\u101b\u102c']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Looks like your terminal is unable to handle the UTF-8 encoded Unicode. Try saving the output by encoding each token into utf-8 as follows.

    # -*- coding: utf-8 -*-

    from _future_ import unicode_literals
    from polyglot.text import Text

    blob = u"""
    ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
    """
    text = Text(blob)


    with open('output.txt', 'a') as the_file:
        for word in text.words:
            the_file.write("\n")
            the_file.write(word.encode("utf-8"))
Suhail Gupta
  • 22,386
  • 64
  • 200
  • 328