-1

I'm working on a project using Python(3.6) in which I need to read text files from a directory which can contain thousands of text files then I need to perform some analysis on them and upload the results to Google cloud storage. There's the encoding errors occur.

Here's what I have tried:

from views.py:

def predict_encoding(file_path, n_lines=60):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.read() for _ in range(n_lines)])
    encoding = chardet.detect(rawdata)['encoding']
    print('Default encoding is: {}'.format(encoding))
    if encoding is None:
        rawdata.decode('utf8').encode('ascii', 'ignore')
        print('updated decoding is: {}'.format(chardet.detect(rawdata)['encoding']))
    return chardet.detect(rawdata)['encoding']


encoding = predict_encoding(text_path)
txt = Path(text_path).read_text(encoding=encoding)

but for some files (See an example file below:) it returns an error like:

/Users/abdul/Downloads/to_save/cert2.txt

Default encoding is: None

updated decoding is: None

return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 339: character maps to

Here's the example at which it's returning this error: https://textuploader.com/d8ec5

Community
  • 1
  • 1
Abdul Rehman
  • 5,326
  • 9
  • 77
  • 150

1 Answers1

0

The file you are trying to analize is an image (Compress (tm) Xing Technology Corp in file header). So before checking encoding you'll need to check if file is a binary or not. You can use following solution for that:

>>> is_binary_string(open(text_path, 'rb').read(1024))
True
Alderven
  • 7,569
  • 5
  • 26
  • 38
  • It's working well inside the Python's shell but in the Django file it return an error as: `TypeError: translate() takes exactly one argument (2 given)` – Abdul Rehman Dec 28 '18 at 05:13