Encoding error when combining text files

Question

I'm trying to run this code:

import glob
import io

read_files = filter(lambda f: f!='final.txt' and f!='result.txt', glob.glob('*.txt'))


with io.open("REGEXES.rx.txt", "w", encoding='UTF-32') as outfile:
    for f in read_files:
        with open(f, "r") as infile:
            outfile.write(infile.read())
            outfile.write('|')

To combine some text files and I get this error:

Traceback (most recent call last):
  File "/Users/kosay.jabre/Desktop/Password Assessor/RegexesNEW/CombineFilesCopy.py", line 10, in <module>
    outfile.write(infile.read())
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 2189: ordinal not in range(128)

I've tried UTF-8, UTF-16, UTF-32 and latin-1 encodings. Any ideas?

Are you sure that the UnicodeDecodeError has anything to do with your output file encoding? Looks to me like it's caused by the `infile.read()` part of your code. Perhaps you should try `infile.read().decode("utf-8")` — pholtz, Mar 14 '16 at 19:13
Ok. Some questions to help better understand the issue: How many files are in `read_files`? What is the character encoding of each file? Are all of the encodings the same? If you try to read from each of this unknown number of files expecting a specific encoding and even one of them has a different encoding than what you're expecting, it's probably going to break. Sure, if they're all UTF-8 with only simple characters you might be fine, but clearly that's not the case here. — pholtz, Mar 14 '16 at 20:12
@Kos. What @pholtz said. And could you also show the output of `print(sys.getdefaultencoding())`? — ekhumoro, Mar 14 '16 at 20:18
There are ~50 files in read_files. I'm not sure of the character encoding of each file but I know there are unicode characters inside. All the encodings should be the same. print(sys.getdefaultencoding()) returns "utf-8". — , Mar 14 '16 at 20:29

Alastair McCormack · Accepted Answer · 2016-03-15T09:29:30.753

1

You're getting the error from infile.read(). The file was opened in text mode without an encoding specified. Python will try to guess your default file encoding but may default to ascii. Any byte larger than \x7f / 127 is not ASCI, so will throw an error.

You need to know the encoding of your files before you proceed, otherwise you will get errors if Python tries to read one encoding and gets another, or you will simply get mojibake.

Assuming that infile will be utf-8 encoded, change:

with open(f, "r") as infile:

to:

with open(f, "r", encoding="utf-8") as infile:

You may also want to change outfile's encoding to UTF-8 to avoid potential storage wastage. Because the input is being decoded to plain Unicode, infile and outfile's encoding don't need to match.

edited Mar 15 '16 at 09:29

answered Mar 14 '16 at 20:28

Alastair McCormack

26,573
8
77
100

Hi, `with io.open("REGEXES.rx.txt", "w", encoding='utf-8') as outfile: for f in read_files: with open(f, "r", encoding = 'utf-8') as infile: outfile.write(infile.read()) outfile.write('|')` returns the same error – Mar 14 '16 at 20:31
Yes `Traceback (most recent call last): File "/Users/kosay.jabre/Desktop/Password Assessor/RegexesNEW/CombineFilesCopy.py", line 14, in outfile.write(infile.read()) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 2189: invalid start byte` – Mar 14 '16 at 20:33
So not the same error. Your input files are not "utf-8". You should ideally know what encoding they are. If not, trying setting `encoding="latin1"` on the `infile`. – Alastair McCormack Mar 14 '16 at 20:35
I had tried latin-1 on the outfile, but forgot to do it on the infile! This works. Thanks very much – Mar 14 '16 at 20:38
Good stuff! See my update for clarifications on a few things. – Alastair McCormack Mar 14 '16 at 20:40
1

@Kos, one thing to note, `latin1` encoding will "work" on any file, but if the files aren't all latin1-encoded you will get mojibake. Some byte sequences are illegal in UTF-8 which is why non-UTF-8 files give errors, but latin1 accepts anything. – Mark Tolonen Mar 16 '16 at 02:21

Encoding error when combining text files

1 Answers1