0

I've read more than once the "unicode on python 2.7 how-to" and browsed this forum thoroughly, but nothing I found and tried makes my program work.

It's supposed to convert dictionary.com entries into sets of example sentences and also word-pronunciation pairs. Yet it fails at the very start: the IPA (i.e. unicode) characters are converted into gibberish right after they're entered.

# -*- coding: utf-8 -*-

""" HERE'S HOW A TYPICAL DICTIONARY.COM ENTRY LOOKS LIKE
white·wash
/ˈʰwaɪtˌwɒʃ, -ˌwɔʃ, ˈwaɪt-/ Show Spelled
noun
1.
a composition, as of lime and water or of whiting, size, and water, used for whitening walls, woodwork, etc.
2.
anything, as deceptive words or actions, used to cover up or gloss over faults, errors, or wrongdoings, or absolve a wrongdoer from blame.
3.
Sports Informal. a defeat in which the loser fails to score.
verb (used with object)
4.
to whiten with whitewash.
5.
to cover up or gloss over the faults or errors of; absolve from blame.
6.
Sports Informal. to defeat by keeping the opponent from scoring: The home team whitewashed the visitors eight to nothing.
"""

def wdefinp():   #word definition input
    wdef=u''
    emptylines=0 
    print '\nREADY\n\n'
    while True:
        cinp=raw_input()   #current input line
        if cinp=='':
            emptylines += 1
            if emptylines >= 3:   #breaking out by 3xEnter
                wdef=wdef[:-2]
                return wdef
        else:
            emptylines = 0
        wdef=wdef + '\n' + cinp
    return wdef

wdef=wdefinp()
print wdef.decode('utf-8')

this yields: white·wash /�ʰwaɪtˌwɒ�, -ˌwɔ�, �waɪt-/ Show Spelled ...

Any help will be appreciated.

1 Answers1

0

ok I managed to replicate a couple of faults with your program

Firstly if I ran it in a terminal and pasted the example text in I would get an error at this line (sorry my line numbering doesn't match yours):

  File "unicod.py", line 22, in wdefinp
    wdef=wdef + '\n' + cinp
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 5: ordinal not in range(128)

To fix this I used the answer from this stackoverflow question: How to read Unicode input and compare Unicode strings in Python?

The fixed line is

cinp = raw_input().decode(sys.stdin.encoding)

Basically you need to know the input encoding, then converting to utf8 is possible

Once that is fixed next issue is a similar problem

File "unicod.py", line 28, in <module>
    print wdef.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 6: ordinal not in range(128)

Because the data coming back from the function is already utf8 "double decoding" it will not work. Simply remove the ".decode('utf8')" and it works fine

Community
  • 1
  • 1
Vorsprung
  • 32,923
  • 5
  • 39
  • 63