I'm having difficulty getting the character encoding of a file. The offending code is here:
rawdata = open(file, "r").read()
encoding = chardet.detect(rawdata.encode())['encoding']
#return encoding
(Code courtesy of Ashish Greycube: https://github.com/frappe/frappe/pull/8061
I've copied a segment of the csv file I'm working on as a more manageable 'test' file. When I run the above code on it, it says it's 'ascii'. That might be part of the problem. Basically, I've found out that I need to know the encoding type for this prpogram.
The error report is as follows:
Traceback (most recent call last):
File ".\program.py", line 26, in <module>
my_encoding = get_file_encoding(data)
File ".\program.py", line 20, in get_file_encoding
encoding = chardet.detect(rawdata.encode())['encoding']
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\__init__.py", line 38, in detect
detector.feed(byte_str)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\universaldetector.py", line 211, in feed
if prober.feed(byte_str) == ProbingState.FOUND_IT:
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetgroupprober.py", line 71, in feed
state = prober.feed(byte_str)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\hebrewprober.py", line 227, in feed
byte_str = self.filter_high_byte_only(byte_str)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetprober.py", line 63, in filter_high_byte_only
buf = re.sub(b'([\x00-\x7F])+', b' ', buf)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\re.py", line 208, in sub
return _compile(pattern, flags).sub(repl, string, count)
MemoryError
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
File ".\program.py", line 26, in <module>
my_encoding = get_file_encoding(data)
File ".\program.py", line 19, in get_file_encoding
rawdata = open(file, "r").read()
FileNotFoundError: [Errno 2] No such file or directory: 'ANQAR.csv'
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
File ".\program.py", line 26, in <module>
my_encoding = get_file_encoding(data)
File ".\program.py", line 19, in get_file_encoding
rawdata = open(file, "r").read()
FileNotFoundError: [Errno 2] No such file or directory: 'ANQAR.csv'
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
File ".\program.py", line 26, in <module>
my_encoding = get_file_encoding(data)
File ".\program.py", line 20, in get_file_encoding
encoding = chardet.detect(rawdata.encode())['encoding']
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\__init__.py", line 38, in detect
detector.feed(byte_str)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\universaldetector.py", line 211, in feed
if prober.feed(byte_str) == ProbingState.FOUND_IT:
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetgroupprober.py", line 71, in feed
state = prober.feed(byte_str)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\hebrewprober.py", line 227, in feed
byte_str = self.filter_high_byte_only(byte_str)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetprober.py", line 63, in filter_high_byte_only
buf = re.sub(b'([\x00-\x7F])+', b' ', buf)
File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\re.py", line 208, in sub
return _compile(pattern, flags).sub(repl, string, count)
MemoryError