0

I have written some code that opens all text files in a folder, removes certain words, and then writes the content to a new text file. The files I'm working with were created on a windows machine, saved in utf-8, and then downloaded to a mac (problematic). The code works for 66 out of 250 files and then breaks. I'm getting the following error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-51-7c4734f2a95f> in <module>
      1 for file in os.listdir(path):
      2         with open(file, 'r', encoding='utf-8') as f:
----> 3             flines = f.readlines()
      4             new_content = []
      5             for line in flines:

~/anaconda/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 287: invalid start byte

I've checked the file encoding for a few files that aren't being converted in the terminal using file -I {filename} and it does say charset=utf-8. However I think the problem must be the encoding.

I've tried doing 'encoding='ascii'', and using 'rb' instead of 'r', but no success. I think this could help me but I can't work out how to incorporate it in to my code https://www.programiz.com/python-programming/methods/string/encode.

Any help would be really appreciated.

for file in os.listdir(path):       
        with open(file, 'r', encoding='utf-8') as f:
            flines = f.readlines()
            new_content = []
            for line in flines: 
                content = line.split()

            new_content_line = []
            new_content_line2 = []

            fillers = ['um', 'uum', 'umm', 'er', 'eer', 'uh', 'ah', 'ahh', 'hm', 'hmm', 'mm', 'Um', 'Uum', 'Umm', 'Er', 'Eer', 'Uh', 'Ah', 'Ahh', 'Hm', 'Hmm', 'Mm']

            for word in content:
                if not word.startswith('[=')and not word.startswith('#') and not word.startswith('..') and not word.endswith(']') and not word.endswith('='):
                    new_content_line.append(word)

            for word in new_content_line:
                if word not in fillers:
                    new_content_line2.append(word)

            new_content_line2 = [x.lower() for x in new_content_line2]
            for v, w in zip(new_content_line2[:-1],new_content_line2[1:]):
                if v == w:
                    new_content_line2.remove(v)

            new_content.append(' '.join(new_content_line2))

        f2 = open((file.rsplit( ".", 1 )[ 0 ] ) + "_processed.txt", "w", encoding = 'utf-8')
        f2.write('\n'.join(new_content))
        f.close
        f2.close
firefly
  • 339
  • 2
  • 10
  • 1
    Are you having trouble detecting the files or are you just having issues with some files having characters/bytes that are throwing off your code? Your title might be misleading. Also, just a small note, since you're familiar with the `with open() as f:` construct, you can use that for `f2` as well. Once the code block is executed, it calls `.close()` for you so you can remove those lines. – Cohan May 03 '19 at 14:43
  • 1
    `file` only tries to apply some heuristics to determine the type; it doesn't actually try to decode the entire file. It may *look* like the file contains UTF-8-encoded text, but there is apparently an issue somewhere, since the *actual* decoding is failing. – chepner May 03 '19 at 14:46
  • 1
    Here is another [answer](https://stackoverflow.com/a/51763708/3545273) of mine that explains how to process files when you are unsure of the encoding. This other one was targetted at pandas `read_csv` but the relevant parameters are the same for the `open` function. – Serge Ballesta May 03 '19 at 15:10
  • @BrianCohan yes you're right, it's not a detection problem. Have ammended my title. Thanks for the tip with closing files, I'll change that! – firefly May 03 '19 at 15:34
  • @SergeBallesta that worked, thank you so much! I added the following lines: `file_encoding = 'utf8' with open(file, encoding=file_encoding, errors = 'ignore') as f:` . I did try with `'backslashreplace'` but got some weird characters in the new file. – firefly May 03 '19 at 15:35

0 Answers0