1

Below codes were used in Python 2 to combine all txt files in a folder. It worked fine.

import os

base_folder = "C:\\FDD\\"

all_files = []

for each in os.listdir(base_folder):
    if each.endswith('.txt'):
        kk = os.path.join(base_folder, each)
        all_files.append(kk)

with open(base_folder + "Combined.txt", 'w') as outfile:
    for fname in all_files:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

When in Python 3, it gives an error:

Traceback (most recent call last):
  File "C:\Scripts\thescript.py", line 26, in <module>
    for line in infile:
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'CP_UTF8' codec can't decode byte 0xe4 in position 53: No mapping for the Unicode character exists in the target code page.

I made this change:

with open(fname) as infile:

to

with open(fname, 'r', encoding = 'latin-1') as infile:

It gives me “MemoryError”.

How can I correct this error in Python 3? Thank you.

Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    There is a 2to3.py tool. Have you looked into that? Link: https://docs.python.org/2/library/2to3.html – tafaust Oct 15 '19 at 06:26
  • always put full error message (starting at word "Traceback") in question (not comment) as text (not screenshot). There are other useful information. – furas Oct 15 '19 at 06:32
  • @ThomasHesse, thank you. I just tired it. It says "RefactoringTool: No changes to thescript.py" – Mark K Oct 15 '19 at 06:36
  • 1
    try https://stackoverflow.com/a/12468274/2970853 – henrywongkk Oct 15 '19 at 06:37
  • try reading only one file (or a smaller number of files) and see if it still gives you a memory error. If this works, then it is really a memory error, and can be solved quickly – Massifox Oct 15 '19 at 06:46
  • @Massifox, the generated output file size 374kb with contents before the error pops. it seems writing. – Mark K Oct 15 '19 at 06:49
  • @henrywongkk, your way works! can you post as an answer, so we can close this question? – Mark K Oct 15 '19 at 07:05
  • 2
    `CP_UTF8` is an alias for `cp65001`, a Microsoft implementation which differed slightly from the standard UTF-8. In Python 3, try specifiying `encoding='utf-8'` in your `open` statements. The implementations no longer differ, so in Python 3.8 `cp65001` is an alias for `UTF-8`. – snakecharmerb Oct 15 '19 at 07:07

1 Answers1

2

As @transilvlad suggested here, use the open method from the codecs module to read in the file:

import codecs
with codecs.open(fname, 'r', encoding = 'utf-8', 
                 errors='ignore') as infile:

This will strip out (ignore) the characters in the error returning the string without them.

Community
  • 1
  • 1
henrywongkk
  • 1,840
  • 3
  • 17
  • 26
  • Solution: the "with open(fname) as infile:" in the question, changed to "with codecs.open(fname, 'r', encoding = 'utf-8', errors='ignore') as infile:" – Mark K Oct 15 '19 at 07:15