4

I've read every post I can find, but my situation seems unique. I'm totally new to Python so this could be basic. I'm getting the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 70: character maps to undefined

When I run the code:

import csv

input_file = 'input.csv'
output_file = 'output.csv'
cols_to_remove = [4, 6, 8, 9, 10, 11,13, 14, 19, 20, 21, 22, 23, 24]

cols_to_remove = sorted(cols_to_remove, reverse=True)
row_count = 0 # Current amount of rows processed

with open(input_file, "r") as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='') as result:
        writer = csv.writer(result)
        for row in reader:
            row_count += 1
            print('\r{0}'.format(row_count), end='')
            for col_index in cols_to_remove:
                del row[col_index]
            writer.writerow(row)

What am I doing wrong?

emremrah
  • 1,733
  • 13
  • 19
Momboosa
  • 43
  • 1
  • 4

3 Answers3

6

In Python 3, the csv module processes the file as unicode strings, and because of that has to first decode the input file. You can use the exact encoding if you know it, or just use Latin1 because it maps every byte to the unicode character with same code point, so that decoding+encoding keep the byte values unchanged. Your code could become:

...
with open(input_file, "r", encoding='Latin1') as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='', encoding='Latin1') as result:
        ...
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • This seems to at least make it past the error so far. There are 7 million lines still left to process, so it will be interesting to see the output. I did confirm the encoding was UTF-8, but adding an `encoding="utf8"` as others suggested resulted in a different decoding error. – Momboosa Nov 28 '19 at 07:07
4

Add encoding="utf8" while opening file. Try below instead:

with open(input_file, "r", encoding="utf8") as source:
    reader = csv.reader(source)
    with open(output_file, "w", newline='', encoding="utf8") as result:
  • UTF-8 is usually the default encoding in Python 3. If it broke with the default encoding it is likely to break with UTF-8... – Serge Ballesta Nov 28 '19 at 06:44
  • @SergeBallesta Any idea why it's trying to decode it with `charmap` by default? I don't see he/she specified encoding when opening the file so I expected it to be UTF-8 as well. – emremrah Nov 28 '19 at 06:50
0
  1. Try pandas

input_file = pandas.read_csv('input.csv') output_file = pandas.read_csv('output.csv')

  1. Try saving the file again as CSV UTF-8
Hamza Zubair
  • 1,232
  • 13
  • 21