-1

I am a python beginner. I am trying to add(concatenate) the text from all the 8 text files into one text file to make a corpus. However, I am getting the error UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7311: character maps to

 filenames = glob2.glob('Final_Corpus_SOAs/*.txt')  # list of all .txt files in the directory
 print(filenames)

output: ['Final_Corpus_SOAs\\1.txt', 'Final_Corpus_SOAs\\2.txt', 'Final_Corpus_SOAs\\2018 SOA Muir.txt', 'Final_Corpus_SOAs\\3.txt', 'Final_Corpus_SOAs\\4.txt', 'Final_Corpus_SOAs\\5.txt', 'Final_Corpus_SOAs\\6.txt', 'Final_Corpus_SOAs\\7.txt', 'Final_Corpus_SOAs\\8.txt']

with open('output.txt', 'w',encoding="utf-8") as outfile:
for fname in filenames:
    with open(fname) as infile:
        for line in infile:
            outfile.write(line)

Output: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7311: character maps to undefined

Thanks for the help.

user8720570
  • 75
  • 1
  • 8

2 Answers2

0

You should specify the encoding type while opening the file. Please see this link for more information. As this was already answered here.

Add encoding="utf8" to you code like below

with open('output.txt', 'w', encoding="utf8") as outfile:
for fname in filenames:
    with open(fname) as infile:
        for line in infile:
        outfile.write(line)
Sabesh
  • 310
  • 1
  • 11
  • Thanks for the answer@Sabesh.I have tried the encoding="utf-8" as well as errors='ignore'.It is still showing the error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7311: character maps to ". – user8720570 Jan 14 '19 at 21:47
0

If you are sure of the encoding, you should declare it when you open the files, both for reading and writing:

encoding = 'utf8'    # or 'latin1' or 'cp1252' or...

with open('output.txt', 'w',encoding=encoding) as outfile:
for fname in filenames:
    with open(fname, encoding=encoding) as infile:
        for line in infile:
            outfile.write(line)

If you are unsure or do not want to be bothered by encoding, you can copy the files at the byte level by reading and writing them as binary:

with open('output.txt', 'wb') as outfile:
for fname in filenames:
    with open(fname, 'rb') as infile:
        for line in infile:
            outfile.write(line)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252