1

I am currently running: Python 3.5.1 :: Anaconda 4.0.0 (x86_64).

ERROR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7601: ordinal not in range(128)

When running the below code I get the above error. When I save and try to open the txt file from a local directy I experience the same error, however, when I save and run a duplicate, that I shorten to ~25 lines the run as expected -- any guidance would be very much appreciated.

import numpy as np
import matplotlib.pyplot as pp
import seaborn
import urllib.request


urllib.request.urlretrieve('ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt','stations.txt')

print(open('stations.txt','r').readlines()[:10])
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
JMH
  • 193
  • 2
  • 4
  • 16
  • 2
    Did you check what encoding was used for the file? I'm sure the NOAA specifies that somewhere. Then use that encoding when opening the file. – Martijn Pieters Jul 01 '16 at 18:24
  • Also, if you only need the first 10 lines, don't read the whole file first; it's a large file. `from itertools import islice`, then `lines = list(islice(openfileobj, 10))` would give you the first 10 lines of an open file object without reading the rest. – Martijn Pieters Jul 01 '16 at 18:30
  • Was only print the first 10 lines to see if it was working properly, as it is is a large file. – JMH Jul 01 '16 at 18:48
  • I'm curious why if I delete the majority of the document the code will run (i.e. if i delete all of the txt file down to 50 lines then print the first 10 it will work, but not when the file is full) -- is it possible for certain lines to be encoded differently than others in a txt? – JMH Jul 01 '16 at 18:50
  • Everything works in ipython, however, does not work the same when I try to run in sublime text – JMH Jul 01 '16 at 18:51
  • Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – Athena Jul 01 '16 at 18:56
  • @jmh1092: different locale; you are not specifying an encoding, so `locale.getpreferredencoding()` is used to determine what codec to use. Never rely on the default. – Martijn Pieters Jul 01 '16 at 18:58

1 Answers1

3

Unfortunately, the documentation for that directory does not specify what codec is used for the files, so I opened the file in binary mode instead and found the bytes that caused 'offense'.

The data is encoded as UTF-8; the 'offending' bytes you encounter spell out ESPAñOLA:

>>> line
b'US1NMRA0022  36.0456 -106.1517 1955.0 NM ESPA\xc3\xb1OLA 5.4 WNW                           \n'
>>> line.decode('utf8')
'US1NMRA0022  36.0456 -106.1517 1955.0 NM ESPAñOLA 5.4 WNW                           \n'

That's the 63815th line in the file, if you are curious, which is why you don't see this issue when you truncate the file.

Open the file with that codec:

open('stations.txt', 'r', encoding='utf8')

Don't rely on the default, which depends on your locale (which easily differs from environment to environment).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343