0

A large set of files (500+) to analyze. Most are already in plain text but it has been proven that 50+ are unicode. Yes, I could manually open and save each one as plain text but this is something I will be doing with at least another 10 data sets. If it helps I'm using Jupyter notebooks to write.

The current code is to just detect if it is a unicode file. If it is I'm going to write to add those files to a list and convert those. If it is not I just want to print the statement that it isn't a unicode file.

import os, os.path
import codecs

basepath = r"FULL/PATH/TO/FILE"

os.chmod(r"FULL/PATH/TO/FILE", 777)

for root, dirs, files in os.walk(basepath, topdown=False):
    for name in files:
        try:
            data = codecs.open(name, encoding='utf-8', errors='strict')
            for line in data:
                #write to add those files to a list to be converted and 
                 #rewritten to a folder deleting other files

                pass
            print (data + "is valid UTF")
        except UnicdeDecodeError:
            print (data + "invalid UTF")

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-1bae56973fdf> in <module>
     12         try:
---> 13             data = codecs.open(name, encoding='utf-8', errors='strict')
     14             for line in data:

~\AppData\Local\Continuum\anaconda3\lib\codecs.py in open(filename, mode, encoding, errors, buffering)
    896         mode = mode + 'b'
--> 897     file = builtins.open(filename, mode, buffering)
    898     if encoding is None:

FileNotFoundError: [Errno 2] No such file or directory: 'plaintext1.txt'

During handling of the above exception, another exception occurred:

NameError                                 Traceback (most recent call last)
<ipython-input-1-1bae56973fdf> in <module>
     17                 pass
     18             print (data + "is valid UTF")
---> 19         except UnicdeDecodeError:
     20             print (data + "invalid UTF")
     21 

NameError: name 'UnicdeDecodeError' is not defined
ihatecsv
  • 532
  • 6
  • 20
  • 1
    Be aware that converting from Unicode (likely UTF-8) to ASCII ("plain text") can result in data loss if non-ASCII characters are used to store data in your files. – ihatecsv Jul 23 '19 at 13:30
  • @ihatecsv Is there a precaution I can use? Is there a different language that has a tool for this? I'm willing to write it in IDL or C++. – Pandasncode Jul 23 '19 at 13:33
  • No, it's impossible for ASCII files to store characters other than what's in the ASCII character set. If you have non-ASCII characters before conversion, no matter the programming language, they will be lost. You can encode the non-ASCII characters as Unicode escape sequences, but they will no longer "look" like a character without explicitly decoding them: https://stackoverflow.com/a/19527434/3894173 – ihatecsv Jul 23 '19 at 13:42
  • I think you are misusing the term "plain text" here. What do you really mean? Also Unicode is a character set. I think you meant specifically one of its character encodings: UTF-8. Are you sure that all the files aren't encoded with UTF-8? (Only the writer can really tell you.) – Tom Blodget Jul 23 '19 at 16:56
  • @TomBlodget When you save one of these files manually, as .txt, and then run the analysis script over it again it will work. But if you do it without, it will spit out Unicode text and not work for the analysis. I'm positive the others aren't as this issue has not been encountered until now and the analysis over them hasn't yielded these results with these until now. This process has been going on for a long time. – Pandasncode Jul 23 '19 at 19:20
  • Okay, let's try this from the other direction. Which character encoding is the downstream process expecting? – Tom Blodget Jul 23 '19 at 19:48
  • Just a .txt file. Nothing more. – Pandasncode Jul 24 '19 at 12:42
  • There is no text but encoded text. A text file has to be read with the character encoding it was written with. So, both the bytes and a mutual agreement on the character encoding are required. – Tom Blodget Jul 25 '19 at 04:06
  • Okay. So, I'm confused as to how this hasn't been an issue until now, I guess. all of the sudden the program is spitting back unicode characters for only some files and the solution is to open and save the original files again before running it back through the program. There has to be an easier way than opening every individual file. – Pandasncode Jul 25 '19 at 15:07
  • The error is the result of a typo: `UnicdeDecodeError` is missing the "o" in "Unicode". – snakecharmerb Aug 16 '20 at 12:28

1 Answers1

0

You can use the library magic, and glob

import magic
import glob
#Read *.txt in current directory
for x in glob.iglob("*.txt"):
    blob = open(x).read()
    m = magic.open(magic.MAGIC_MIME_ENCODING)
    m.load()
    #Detect Encoding of file
    encoding = m.buffer(blob)
    if encoding == "UTF-8":
        #dont do anything
    elif encoding == "ASCII":
        #convert to UTF-8
        sourceEncoding = "ASCII"
        targetEncoding = "utf-8"
        source = open(x)
        target = open(x, "w")
        target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding)

Trevor
  • 118
  • 6
  • Thanks for the idea. Your code is converting ASCII to UTF correct? I'm going the other way but the idea helped! – Pandasncode Jul 23 '19 at 14:59