Is it possible to "sniff" the Character encoding?

Question

I have a webpage that accepts CSV files. These files may be created in a variety of places. (I think) there is no way to specify the encoding in a CSV file - so I can not reliably treat all of them as utf-8 or any other encoding.

Is there a way to intelligently guess the encoding of the CSV I am getting? I am working with Python, but willing to work with language agnostic methods too.

There are ways, as long as you can live with mis-detections, because there's no 100% sure-fire way to guess the encoding. — Joachim Sauer, May 27 '13 at 10:55
possible duplicate of [Is there a Python library function which attempts to guess the character-encoding of some bytes?](http://stackoverflow.com/questions/269060/is-there-a-python-library-function-which-attempts-to-guess-the-character-encodin) — Joachim Sauer, May 27 '13 at 10:56
You can detect the encoding pretty reliably if you know the language these files are in - do you? — georg, May 27 '13 at 11:02
They will be in english most of time, but I can't be sure. This should accept any csv. — shabda, May 27 '13 at 11:42
@shabda If you are language-agnostic, then MAYBE this counts for the encoding as well. In this case - and if you just write the data into another file or so - you can assume `latin1` as this takes all data "as they are" (bytes -> unicode) and write them out again (or, in Py2, stay in `str` instead of `unicode`). — glglgl, May 27 '13 at 13:17
I have yet to see a CSV file which is not in UTF-8. I suggest not supporting encodings other than UTF-8. — Pavel Radzivilovsky, May 28 '13 at 19:26

score 8 · Accepted Answer · answered May 27 '13 at 11:13

8

There is no correct way to determine the encoding of a file by looking at only the file itself, but you can use some heuristics-based solution, eg.: chardet

answered May 27 '13 at 11:13

asciimoo

631
5
9

Is it possible to "sniff" the Character encoding?

1 Answers1

Linked