'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

Question

I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)

According to the SEC the data set is provided in a single encoding, as follows:

Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.

My current code:

import csv

with open('txt.tsv') as tsvfile:
    reader = csv.DictReader(tsvfile, dialect='excel-tab')
    for row in reader:
        print(row)

All attempts ended with the following error message:

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I am a bit lost. Can anyone help me?

Also, is this Python 2 or 3? The answer is *very* important, since the `csv` module is broken for non-ASCII on Python 2. — ShadowRanger, Jan 02 '18 at 20:49
Hmm... On rereading the error, I'm pretty sure the problem is your input file. The error indicates it is trying to read it as `utf-8`, so your input likely doesn't follow the format described. That said, the file you linked seems to follow it just fine (it's pure ASCII AFAICT; it uses some unusual ASCII control characters, but they're all in the ASCII range), so I'm not sure where you'd see a `\xa0` byte. Is it possible you modified the file by accident before using it? — ShadowRanger, Jan 02 '18 at 21:04
see below the answer of Kopytok. if I change the encoding to 'windows-1252' it works perfect. — Vital, Jan 02 '18 at 21:09
A side-note: You should be passing `newline=''` to `open` when working with CSV-like stuff. And the `excel_tab` dialect is wrong here; it assumes line endings are `\r\n`, when the file is `\n` endings. Defining your own dialect based off `excel_tab` would be an easy solution, just subclass it and set the class level variable `lineterminator = '\n'` — ShadowRanger, Jan 02 '18 at 21:13

score 67 · Accepted Answer · answered Jan 02 '18 at 21:00

67

Encoding in the file is 'windows-1252'. Use:

open('txt.tsv', encoding='windows-1252')

answered Jan 02 '18 at 21:00

koPytok

3,453
1
14
29

Thank you very much!! That works! May I ask you why it works with 'windows-1252' although the SEC states it is 'utf-8'? – Vital Jan 02 '18 at 21:06
Are you sure it's cp1252? The file I downloaded appeared to be ASCII. If it's not UTF-8, and not ASCII, it could be literally any single-byte-per-character ASCII superset and you'd only be able to guess at the encoding heuristically (it would successfully decode under any of them, but the results might be garbage). – ShadowRanger Jan 02 '18 at 21:11
1

@Vital Better ask SEC – koPytok Jan 02 '18 at 21:14
@ShadowRanger encoding detector detected cp-1252 and the result seems to be legit – koPytok Jan 02 '18 at 21:15
4

This has the potential of producing invalid results. CP-1252 will happily decode *anything* (audio data, core dumps, zip archives) and pretend it's all valid text. – tripleee Jan 03 '18 at 07:07
1

Casual inspection of my download of `txt.tsv` indicates no 0xa0 character at the offset indicated in the question, but plenty of 0xa0 characters which are apparently representing hard spaces, and 0xac characters in a position which indicates a currency indicator as well as 0xae which apparently is the ®‎ symbol. This is *almost* consistent with CP1252 or ISO-8859-1 (which of course are very similar), but the 0xac doesn't fit with either. Maybe see also https://cdn.rawgit.com/tripleee/8bit/master/encodings.html#ac *(cough.)* – tripleee Jan 03 '18 at 07:07
In my case, I had a text file with Windows CRLF instead of Unix LF. – TBirkulosis Nov 29 '22 at 22:03

score 4 · Answer 2 · edited Nov 13 '18 at 08:50

4

If someone works on Turkish data, then I suggest this line:

df = pd.read_csv("text.txt",encoding='windows-1254')

edited Nov 13 '18 at 08:50

Unheilig

16,196
193
68
98

answered Nov 13 '18 at 08:33

Hasim D

112
3

score 3 · Answer 3 · edited Mar 03 '19 at 21:53

3

ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252')

Works fine for me, thanks.

edited Mar 03 '19 at 21:53

Andrew

26,706
9
85
101

answered Mar 03 '19 at 21:11

raj kumar

31
1

score 2 · Answer 4 · answered Feb 26 '21 at 09:00

If the input has a stray '\xa0', then it's not in UTF-8, full stop.

Yes, you have to either recode it to UTF-8 (see: iconv, recode commands, or a lot of text editors and IDEs can do it), or read it using an 8-bit encoding (as all the other answers suggest).

What you should ask yourself is - what is this character after all (0xa0 or 160)? Well, in many 8-bit encodings it's a non-breaking space (like   in HTML). For at least one DOS encoding it's an accented "a" character. That's why you need to look at the result of decoding it from the 8-bit encoding.

BTW, sometimes people say "UTF-8", and they mean "mostly ASCII, I guess". And if it was a non-breaking space, they weren't that far:

In [1]: '\xa0'.encode()
Out[1]: b'\xc2\xa0'

One exptra preceeding '\xc2' byte would do the trick.

score 2 · Answer 5 · answered Feb 09 '22 at 03:02

2

I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn't work.

df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')

answered Feb 09 '22 at 03:02

Suresh Gautam

816
8
21

score 1 · Answer 6 · answered Jan 29 '19 at 11:41

1

i have the same error message for .csv file, and This Worked for me :

     df = pd.read_csv('Text.csv',encoding='ANSI')

answered Jan 29 '19 at 11:41

Ghulam Dastgeer

187
9
19

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

6 Answers6

Linked

Related