Combining multiple txt files (Python 3, UnicodeDecodeError)

Question

Below codes were used in Python 2 to combine all txt files in a folder. It worked fine.

import os

base_folder = "C:\\FDD\\"

all_files = []

for each in os.listdir(base_folder):
    if each.endswith('.txt'):
        kk = os.path.join(base_folder, each)
        all_files.append(kk)

with open(base_folder + "Combined.txt", 'w') as outfile:
    for fname in all_files:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

When in Python 3, it gives an error:

Traceback (most recent call last):
  File "C:\Scripts\thescript.py", line 26, in <module>
    for line in infile:
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0xe4 in position 53: No mapping for the Unicode character exists in the target code page.

I made this change:

with open(fname) as infile:

to

with open(fname, 'r', encoding = 'latin-1') as infile:

It gives me “MemoryError”.

How can I correct this error in Python 3? Thank you.

There is a 2to3.py tool. Have you looked into that? Link: https://docs.python.org/2/library/2to3.html — tafaust, Oct 15 '19 at 06:26
always put full error message (starting at word "Traceback") in question (not comment) as text (not screenshot). There are other useful information. — furas, Oct 15 '19 at 06:32
@ThomasHesse, thank you. I just tired it. It says "RefactoringTool: No changes to thescript.py" — Mark K, Oct 15 '19 at 06:36
try reading only one file (or a smaller number of files) and see if it still gives you a memory error. If this works, then it is really a memory error, and can be solved quickly — Massifox, Oct 15 '19 at 06:46
@Massifox, the generated output file size 374kb with contents before the error pops. it seems writing. — Mark K, Oct 15 '19 at 06:49
@henrywongkk, your way works! can you post as an answer, so we can close this question? — Mark K, Oct 15 '19 at 07:05
`CP_UTF8` is an alias for `cp65001`, a Microsoft implementation which differed slightly from the standard UTF-8. In Python 3, try specifiying `encoding='utf-8'` in your `open` statements. The implementations no longer differ, so in Python 3.8 `cp65001` is an alias for `UTF-8`. — snakecharmerb, Oct 15 '19 at 07:07

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

2

As @transilvlad suggested here, use the open method from the codecs module to read in the file:

import codecs
with codecs.open(fname, 'r', encoding = 'utf-8', 
                 errors='ignore') as infile:

This will strip out (ignore) the characters in the error returning the string without them.

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 15 '19 at 07:11

henrywongkk

1,840
3
17
26

Solution: the "with open(fname) as infile:" in the question, changed to "with codecs.open(fname, 'r', encoding = 'utf-8', errors='ignore') as infile:" – Mark K Oct 15 '19 at 07:15

Combining multiple txt files (Python 3, UnicodeDecodeError)

1 Answers1