0

I have a comma-separated .txt file with French characters such as Vétérinaire and Désinfectant.

import pandas as pd
df = pd.read_csv('somefile.txt', sep=',', header=None, encoding='utf-8')

[Decode error - output not utf-8]

I have read many Q&A posts (including this) and tried many different encoding such as 'latin1' and 'utf-16', they didn't work. However, I tried to run the exact same script on the different Windows 10 computer with similar Python setup (all Python 3.6), it works perfectly fine in the other computer.

Edit: I tried this. Using encoding='cp1252' helps for some of the .txt files I want to import, but for a few .txt files, it gives the following error.

  File "C:\Program_Files_Extra\Anaconda3\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 25: character maps to <undefined>

Edit: Trying to identify encoding from chardet

import chardet 
import pandas as pd
test_txt = 'somefile.txt'

rawdata = open(test_txt, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']

print (charenc)

df = pd.read_csv(test_txt, sep=',', header=None, encoding=charenc)

print (df.head())

utf-8
[Decode error - output not utf-8]
KubiK888
  • 4,377
  • 14
  • 61
  • 115

1 Answers1

0

Your program opens your files with a default encoding and that doesn't match the contents of the file you are trying to open.

Option 1: Decode the file contents to python string objects:

rawdata = open(test_txt, 'rb', encoding='UTF8').read()

Option 2: Open the csv file in an editor like Sublime Text and save it with utf-8 encoding to easily read the file through pandas.

Niharika Bitra
  • 477
  • 2
  • 9