A large set of files (500+) to analyze. Most are already in plain text but it has been proven that 50+ are unicode. Yes, I could manually open and save each one as plain text but this is something I will be doing with at least another 10 data sets. If it helps I'm using Jupyter notebooks to write.
The current code is to just detect if it is a unicode file. If it is I'm going to write to add those files to a list and convert those. If it is not I just want to print the statement that it isn't a unicode file.
import os, os.path
import codecs
basepath = r"FULL/PATH/TO/FILE"
os.chmod(r"FULL/PATH/TO/FILE", 777)
for root, dirs, files in os.walk(basepath, topdown=False):
for name in files:
try:
data = codecs.open(name, encoding='utf-8', errors='strict')
for line in data:
#write to add those files to a list to be converted and
#rewritten to a folder deleting other files
pass
print (data + "is valid UTF")
except UnicdeDecodeError:
print (data + "invalid UTF")
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-1bae56973fdf> in <module>
12 try:
---> 13 data = codecs.open(name, encoding='utf-8', errors='strict')
14 for line in data:
~\AppData\Local\Continuum\anaconda3\lib\codecs.py in open(filename, mode, encoding, errors, buffering)
896 mode = mode + 'b'
--> 897 file = builtins.open(filename, mode, buffering)
898 if encoding is None:
FileNotFoundError: [Errno 2] No such file or directory: 'plaintext1.txt'
During handling of the above exception, another exception occurred:
NameError Traceback (most recent call last)
<ipython-input-1-1bae56973fdf> in <module>
17 pass
18 print (data + "is valid UTF")
---> 19 except UnicdeDecodeError:
20 print (data + "invalid UTF")
21
NameError: name 'UnicdeDecodeError' is not defined