1

I know there are already hundreds of Python Unicode questions on Stack Overflow. I've read lots of them, but I can't find an answer to mine...

I'm trying to read a latin-1 CSV file. It includes a UK pound sign (character \xa3 in latin-1), so I set encoding="latin-1" -- but Python appears to ignore the encoding. This:

with open(filename, newline='', encoding="latin-1") as csvfile:
    data = csv.reader(csvfile, delimiter=',', quotechar='\"')
    for row in data:
        print(row)

Produces:

UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 202: ordinal not in range(128)

I've cut down the original CSV file to a single line that triggers the problem. It's the £ sign that causes it.

The only solutions I've found are to use errors="ignore" -- which is just hiding the problem, or errors="surrogateescape" -- which is just creating a problem with escaped characters further down the line.

I know that the file encoding is latin-1, although I have also tried utf-8 and iso-8859-1.

Python can happily print a £ sign:

>>> print('£')
> £
>>> print(u'\xa3')
£

Any answers/advice/suggestions would be welcome. Thanks in advance.

=== UPDATE ===

This doesn't produce the error:

with open(file, newline='', encoding="latin-1") as csvfile:
    data = csv.reader(csvfile, delimiter=',', quotechar='\"')
    for row in data:
        print("do nothing with the data")
James
  • 667
  • 1
  • 5
  • 17
  • 2
    Can you provide the stacktrace with your error? – Alastair McCormack Nov 07 '18 at 10:41
  • 2
    I suspect it's not the read that's the problem, but when you print the row to the screen. – Alastair McCormack Nov 07 '18 at 10:43
  • I think you're right... see update above. I thought I'd tried that earlier with the same result, but doing it now it doesn't cause the problem. So maybe I have a completely different question to ask later! Is there a quick answer you can share here, please? I can repost later if it needs a new question. Thanks. – James Nov 07 '18 at 10:44
  • And yes, the error is in print(row) in the stack trace. Thanks. – James Nov 07 '18 at 10:48
  • 1
    Strangly, your interactive console works ok. How are you invoking your script containing the CSV read? – Alastair McCormack Nov 07 '18 at 10:51
  • Sublime Text on a Mac. The interactive console is a REPL extension for the editor. When I get home I’ll try it from the command line. Intriguingly, locale.getpreferredencoding() returns US ASCII — which seems odd on a Mac in the UK/Europe. When I get back to my computer I’ll try it outside Sublime Text. – James Nov 07 '18 at 11:12

2 Answers2

2

I was able to reproduce the problem by setting the locale to C, meaning that the character set is limited to ASCII:

$ LC_CTYPE=C python3 foo.py
Traceback (most recent call last):
  File "foo.py", line 7, in <module>
    print(row)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 7: ordinal not in range(128)

Line 7 is the line of the print call, so this problem appears on output, not on input.

With a UTF-8 locale, it works:

$ LC_CTYPE=en_GB.UTF-8 python3 foo.py
['1', '£']
['2', 'a']

You can check the default locale with the locale command:

$ locale
LANG=en_GB.UTF-8
LANGUAGE=en_GB:en
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
legoscia
  • 39,593
  • 22
  • 116
  • 167
0

The answer is a long way from anything I expected when I posted the question. It's nothing to do with Python. It's the editor.

I'm running the code from the Sublime Text 3 editor on a Mac. It turns out that when you do that, the interpreter doesn't get any locale information unless you explicitly pass it on.

I've now discovered that my question is a duplicate of this one:

Printing UTF8 in Python 3 using Sublime Text

The comments and answers above helped me find that other question and get to an answer. So:

If Alastair McCormack or legoscia would like to post the above as an answer then I'll be happy to accept it to thank you for your help. If you don't then I'll accept my own answer so that other people see it.

Or if someone reading this wants to mark my question as a duplicate of the one I've linked above then please go ahead. Thank you all.

James
  • 667
  • 1
  • 5
  • 17