Python - get encoding of file

Question

I'm having difficulty getting the character encoding of a file. The offending code is here:

    rawdata = open(file, "r").read()
    encoding = chardet.detect(rawdata.encode())['encoding']
    #return encoding

(Code courtesy of Ashish Greycube: https://github.com/frappe/frappe/pull/8061

I've copied a segment of the csv file I'm working on as a more manageable 'test' file. When I run the above code on it, it says it's 'ascii'. That might be part of the problem. Basically, I've found out that I need to know the encoding type for this prpogram.

The error report is as follows:

Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 20, in get_file_encoding
    encoding = chardet.detect(rawdata.encode())['encoding']
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\__init__.py", line 38, in detect
    detector.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\universaldetector.py", line 211, in feed
    if prober.feed(byte_str) == ProbingState.FOUND_IT:
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetgroupprober.py", line 71, in feed
    state = prober.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\hebrewprober.py", line 227, in feed
    byte_str = self.filter_high_byte_only(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetprober.py", line 63, in filter_high_byte_only
    buf = re.sub(b'([\x00-\x7F])+', b' ', buf)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\re.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 19, in get_file_encoding
    rawdata = open(file, "r").read()
FileNotFoundError: [Errno 2] No such file or directory: 'ANQAR.csv'
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 19, in get_file_encoding
    rawdata = open(file, "r").read()
FileNotFoundError: [Errno 2] No such file or directory: 'ANQAR.csv'
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 20, in get_file_encoding
    encoding = chardet.detect(rawdata.encode())['encoding']
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\__init__.py", line 38, in detect
    detector.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\universaldetector.py", line 211, in feed
    if prober.feed(byte_str) == ProbingState.FOUND_IT:
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetgroupprober.py", line 71, in feed
    state = prober.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\hebrewprober.py", line 227, in feed
    byte_str = self.filter_high_byte_only(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetprober.py", line 63, in filter_high_byte_only
    buf = re.sub(b'([\x00-\x7F])+', b' ', buf)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\re.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError

Error it shows in exception is that: file isn't present. You sure file is present and you haven't mistyped its name? — Tarique, Mar 06 '20 at 04:36

score 1 · Answer 1 · answered Mar 06 '20 at 04:36

A MemoryError usually implies you're trying to load data too large for your memory, either the address space or available storage (RAM + swap/page file space). You seem to be running a 32 bit build of Python, which would limit you to 2 GB of address space; I'd suggest switching to a 64 bit build, as most machines nowadays have more than 4 GB of RAM, and not using a 64 bit build means you can't use most of it.

Additional issue: When you read the file in text mode, you're already assuming you know the encoding. Don't do that. Open it in binary mode ("rb") to get the raw, unmodified bytes, so chardet gets them directly before you try decoding them in a possibly incorrect encoding.

Using this: rawdata = open(file, "rb").read() encoding = chardet.detect(rawdata)['encoding'] I get encoding as Windows-1252. This is basically just gibberish text, not 'utf-8' or some such that I was hoping for... — m.ravetch, Mar 06 '20 at 21:47

score 1 · Answer 2 · answered Mar 06 '20 at 23:43

1

This works:

import chardet

rawdata = open(file, "rb").read()
encoding = chardet.detect(rawdata)['encoding']

answered Mar 06 '20 at 23:43

m.ravetch

47
1
8

score -1 · Answer 3 · answered Mar 06 '20 at 04:43

-1

like @ShadowRanger said try to build it in 64bit and don't read file in text mode try this

enter co rawdata = open(file, "rb").read()
encoding = chardet.detect(rawdata.encode())['encoding']

and make sure your file is present and write its name correctly.

answered Mar 06 '20 at 04:43

Logitech Flames

43
6

i'm not sure what 'enter co' will do but it returns an error – m.ravetch Mar 06 '20 at 22:37

Python - get encoding of file

3 Answers3