Converting Unicoded text to readable text in Python

Question

I have Unicode text as follows

(S (NP (N \u0db6\u0dbd\u0dbd\u0dcf)) (VP (V \u0db6\u0dbb\u0dc0\u0dcf)))

How do I change this to a readable format by converting the codes '\u0___' in to the relevant readable characters. I'm using python version 2.7

I obtained that output by following code segment in NLTK (3.0) where tree is a nltk.tree.Tree

for tree in treelist1:
    print unicode(str(tree))

I need something like print(TreePrettyPrinter(tree).text()) where it gives unicode compatible output as I wanted, but with a tree layout that I don't want. Is there a method in NLTK to get such a readable text like output too?

Same issue have with the output from

for rule in grammar1.productions():
    print(rule.unicode_repr())

where grammar1 is nltk.grammar.CFG

Output is as follows.

VP -> V
VP -> NP V
N -> '\u0db6\u0dbd\u0dca\u0dbd\u0dcf'
N -> '\u0db8\u0dd2\u0db1\u0dd2\u0dc3\u0dcf'
N -> '\u0db8\u0dda\u0dc3\u0dba'

Final results are perfectly fine. I only have issues with the representation of the output

Did you try printing the value contained in the field itself? — Ignacio Vazquez-Abrams, Sep 28 '15 at 20:05
The Windows console is notoriously bad at handling Unicode strings, you may be better creating some sort of interface or file you can output to, rather than lots of explicit encoding/decoding — Bob Dylan, Sep 28 '15 at 20:07
@IgnacioVazquez-Abrams gives the same output for field itself. ex: print(tree) and print(grammar1) — Upekha Vandebona, Sep 28 '15 at 20:19

score 3 · Accepted Answer · edited May 23 '17 at 11:45

3

Solutions are there in this question. Also works for Python 2.7

Nothing to do with NLTK. Simple solution is just decode the output text with 'unicode_escape'

print(str(tree).decode('unicode_escape'))

and

print(rule.unicode_repr().decode('unicode_escape'))

For NTLK kind of solution for print the tree of type nltk.tree.Tree as a bracketed text, use the following

print(tree.pformat())

edited May 23 '17 at 11:45

Community

1
1

answered Sep 28 '15 at 21:16

Upekha Vandebona

147
11

a simpler solution is to avoid invoking `unicode_repr` or `str`. `'unicode_escape'` masks the bugs upstream. – jfs Sep 29 '15 at 03:09

Converting Unicoded text to readable text in Python

1 Answers1