Problems with the encoding of several concatenated files in python?

Question

I have a folder of lots of .txt files in spanish and I decided to merge them in one .txt file as follows:

import os import shutil

def concatFiles():
    path = '/Users/user/Desktop/OpinionsTAG_txt/'
    files = os.listdir(path)
    with open("/Users/user/Desktop/concat_file.txt", "wb") as fo:
        for f in files:
            with open(os.path.join(path, f), "rb") as fi:
                shutil.copyfileobj(fi, fo)


if __name__ == "__main__":
    concatFiles()

The problem is that the output(i.e. concat_file) doesn't respect the character spanish encoding for example in the concat_file is direcci√≥n instead of dirección. Another thing is that I'm working in OS X, when i open the concat_file with sublime text it looks like this: 0000 0001 2000 0000 0000 0001 4000 0000 and when i open concat_file with text edit it looks as i wanted, why is this happenning and how can i solve it?.

score 1 · Accepted Answer · edited May 23 '17 at 12:12

1

Try using codecs, as suggested here: https://stackoverflow.com/a/19591815/4339369. This will allow you to read and write the files as UTF-8, which may solve your problem.

import codecs
import os
import shutil
def concatFiles():
    path = '/Users/user/Desktop/OpinionsTAG_txt/'
    files = os.listdir(path)
    with codecs.open("/Users/user/Desktop/concat_file.txt", "wb",encoding='utf8') as fo:
        for f in files:
            with codecs.open(os.path.join(path, f), "rb",encoding='utf8') as fi:
                shutil.copyfileobj(fi, fo)


if __name__ == "__main__":
    concatFiles()

A good overview of Unicode issues in Python 2.x can be found here: http://nedbatchelder.com/text/unipain.html.

Edit for anyone in the future: on OS X, you can usually determine a file's encoding by

file -I <filename>

from How do I determine file encoding in OSX?.

edited May 23 '17 at 12:12

Community

1
1

answered Dec 09 '14 at 21:03

greavg

36
4

Thanks, this is the output:`newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 10: invalid continuation byte` – john doe Dec 09 '14 at 21:11
Does this means that the files are not in utf8? – john doe Dec 09 '14 at 21:12
1

It's a bit of a guess-and-check if you don't already know how the files are encoded. – greavg Dec 09 '14 at 21:16
i tried with utf8, utf16, utf32 and ascii. Still get the same exception. What else can i do? – john doe Dec 09 '14 at 21:50
See the terminal command that I edited in at the end of my post. It should be able to tell you how the Spanish files are encoded. – greavg Dec 09 '14 at 21:51
Still have the same problem, when i look at the terminal for info about the .txt files it says the spanish texts are encoded as `utf-8`. This is the trace error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 10: invalid continuation byte – john doe Dec 09 '14 at 22:12
That worked but now instead of direcci√≥n now i have direcciÃ³n. The correct should be dirección. What other type of encoding should i try? – john doe Dec 09 '14 at 22:26
1

cp1252 or mac_roman or utf-7 are possibilities I guess. I'm running out of ideas. – greavg Dec 09 '14 at 22:32

Problems with the encoding of several concatenated files in python?

1 Answers1