0

I'm trying to read in 2 text files, one encoded in UTF8. I'm using Python 3, in PyCharm.

Examples from the 2 files:

1.
its group are in Spain .
its group are in Antarctica .
2.
sus grupos estan en España .
sus grupos estan en Antártida .

From the command line, I use:

paste -d "\n" hw5-tiny.en tiny.es | python3 ibm.py

to read the files into sys.stdin.

In my code, I use the following to read the pasted files:

#!/usr/bin/env python
#coding=utf8
import itertools
import sys

for fgn_sent,eng_sent in itertools.zip_longest(*[sys.stdin]*2):
   print(fgn_sent)

I then get the error:

Traceback (most recent call last):
  File "ibm0.py", line 33, in <module>
    initialize_probabilities()
  File "ibm0.py", line 13, in initialize_probabilities
    for fgn_sent,eng_sent in itertools.zip_longest(*[sys.stdin]*2):
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 863: ordinal not in range(128)

Where line 13 is the line for... line above.

Adam_G
  • 7,337
  • 20
  • 86
  • 148
  • 1
    the encoding comment has zero impact on whether or not your program can handle UTF8. Please post your relevant actual code, as well as the **full text** of the traceback – MattDMo Dec 06 '14 at 20:33

1 Answers1

-1

This post answered my question: How to set sys.stdout encoding in Python 3?

I added PYTHONIOENCODING=utf-8:surrogateescape1 to my command line.

paste -d "\n" tiny.en tiny.es | PYTHONIOENCODING=utf-8:surrogateescape python3 ibm0.py
Community
  • 1
  • 1
Adam_G
  • 7,337
  • 20
  • 86
  • 148