csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

Question

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined

When I run the code:

import csv

input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]

cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed

with open(input_file, "r") as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='') as result:
        writer = csv.writer(result)
        for row in reader:
            row_count += 1
            print('\r{0}'.format(row_count), end='')
            for col_index in cols_to_remove:
                del row[col_index]
            writer.writerow(row)

What am I doing wrong?

this is a decoding error you can find help [here](https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character) — Shijith, Nov 28 '19 at 06:34
You should check the character at position 70 and find a encoding format for that character. Then encode the file accordingly. — Shahir Ansari, Nov 28 '19 at 06:49

score 6 · Accepted Answer · answered Nov 28 '19 at 06:47

6

In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:

...
with open(input_file, "r", encoding='Latin1') as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='', encoding='Latin1') as result:
        ...

answered Nov 28 '19 at 06:47

Serge Ballesta

143,923
11
122
252

This seems to at least make it past the error so far. There are 7 million lines still left to process, so it will be interesting to see the output. I did confirm the encoding was UTF-8, but adding an `encoding="utf8"` as others suggested resulted in a different decoding error. – Momboosa Nov 28 '19 at 07:07

score 4 · Answer 2 · answered Nov 28 '19 at 06:41

4

Add encoding="utf8" while opening file. Try below instead:

with open(input_file, "r", encoding="utf8") as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='', encoding="utf8") as result:

answered Nov 28 '19 at 06:41

shubhambharti201

380
2
8

UTF-8 is usually the default encoding in Python 3. If it broke with the default encoding it is likely to break with UTF-8... – Serge Ballesta Nov 28 '19 at 06:44
@SergeBallesta Any idea why it's trying to decode it with `charmap` by default? I don't see he/she specified encoding when opening the file so I expected it to be UTF-8 as well. – emremrah Nov 28 '19 at 06:50

score 0 · Answer 3 · answered Nov 28 '19 at 06:35

0

Try pandas

input_file = pandas.read_csv('input.csv') output_file = pandas.read_csv('output.csv')

Try saving the file again as CSV UTF-8

answered Nov 28 '19 at 06:35

Hamza Zubair

1,232
13
21

csv read raises "UnicodeDecodeError: 'charmap' codec can't decode..."

3 Answers3

Linked

Related