I have different files differently encoded which I need to read using a program. The problem is how do I read them based on the format they are encoded in automatically? I'm new to python I've tried searching I couldnt go anywhere. Please help?
Asked
Active
Viewed 48 times
-1
-
You need to tell us what kind of format they are encoded in. – Stephen Lin Mar 06 '15 at 06:37
-
I wouldnt know that...some would be in utf-8, some are in spanish formatting(I dont know what that is) – Anudeep Katragadda Mar 06 '15 at 06:40
-
1It is fundamentally impossible to correctly *guess* a file encoding. If you do not know the encoding at all, the best you can do is try them all until you find one that works. Whether or not this is the correct encoding is a different question. You'd have to confirm that manually, or use complicated heuristic tests. – deceze Mar 06 '15 at 06:42
-
What will happen if you just read them, like open(file_name, 'r').read() – Stephen Lin Mar 06 '15 at 06:43
-
throws UnicodeDecodeError – Anudeep Katragadda Mar 06 '15 at 06:45
-
Assuming some are in ISO-8859-1 and others in UTF-8 there are few things you can try - beware they only *could* help - 1/ search the file for non UTF-8 characters : if there is at least one, file is not UTF-8 encoded. 2/ identify some words that are likely to be found in every file and that have different encodings (in french, I would look for `à`) 3/ identify characters that are likely to be found and have different encodings and count how much you find (in spanish, I would look for `ñ`). But you will only get *hints*. – Serge Ballesta Mar 06 '15 at 06:50
1 Answers
0
It is not possible to know for certain, given an encoded text file, what encoding was used; the best you can do is guess -- no certainty.
For guessing purposes, you probably want to download and install https://pypi.python.org/pypi/chardet . Its guesses are well-informed.
But, they are still guesses! And sometimes they'll be wrong. Guessing among the various iso-8859-x
encodings for various values of x
is particularly hard-to-impossible.
In the future, may this teach you to arrange less-impossible conditions!-)

Alex Martelli
- 854,459
- 170
- 1,222
- 1,395
-
thank you alex! So, I have a spanish dataset encoded in latin-1. Does latin-1 deal with english datasets? – Anudeep Katragadda Mar 06 '15 at 06:51
-
@AnudeepKatragadda, yes, all the ISO-8859-* family of codecs (and latin-1 is also known as iso-8859-1) embed ASCII as a subset. (If you need to deal with pound signs and/or euro signs I believe ISO-8859-15 may be better, but it's a long time since I used anything but universal utf-8 coding, so you may want to double check on that:-). – Alex Martelli Mar 07 '15 at 02:10