I'm working on a project using Python(3.6) in which I need to read text files from a directory which can contain thousands of text files then I need to perform some analysis on them and upload the results to Google cloud storage. There's the encoding errors occur.
Here's what I have tried:
from views.py
:
def predict_encoding(file_path, n_lines=60):
'''Predict a file's encoding using chardet'''
import chardet
# Open the file as binary data
with open(file_path, 'rb') as f:
# Join binary lines for specified number of lines
rawdata = b''.join([f.read() for _ in range(n_lines)])
encoding = chardet.detect(rawdata)['encoding']
print('Default encoding is: {}'.format(encoding))
if encoding is None:
rawdata.decode('utf8').encode('ascii', 'ignore')
print('updated decoding is: {}'.format(chardet.detect(rawdata)['encoding']))
return chardet.detect(rawdata)['encoding']
encoding = predict_encoding(text_path)
txt = Path(text_path).read_text(encoding=encoding)
but for some files (See an example file below:) it returns an error like:
/Users/abdul/Downloads/to_save/cert2.txt
Default encoding is: None
updated decoding is: None
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 339: character maps to
Here's the example at which it's returning this error: https://textuploader.com/d8ec5