I am using chardet to detect encoding of text files including Italian. The problem is it consistently detects their encoding as iso-8859-2 while the correct detection would be iso-8859-1. Does anybody know a fix? My local language is set to Polish? Could that influence the detection?
Asked
Active
Viewed 498 times
0
-
Since iso-8859-2 is for Eastern European languages, I would say that yes, that probably influences the detection. Which method do you use to detect the encoding? – Junuxx Oct 10 '12 at 15:36
-
Junuxx - I am using a 'detect' method e.g. chardet.detect(text) – twowo Oct 10 '12 at 15:49
-
I recommend reading the accepted answer in this [question](http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file). – Pedro Romano Oct 10 '12 at 17:23
1 Answers
1
chardet doesn't support iso-8859-1, that's why it's not detecting it. For supported character encodings, see chardets homepage - http://pypi.python.org/pypi/chardet.
I use the Linux program 'file' to get the character encoding of different content, however I'm not sure how safe it is, see my question - Encoding detection in Python, use the chardet library or not?. But it works with great results for me so far.
Btw, your local language should not influence the detection.