6

I have a webpage that accepts CSV files. These files may be created in a variety of places. (I think) there is no way to specify the encoding in a CSV file - so I can not reliably treat all of them as utf-8 or any other encoding.

Is there a way to intelligently guess the encoding of the CSV I am getting? I am working with Python, but willing to work with language agnostic methods too.

dda
  • 6,030
  • 2
  • 25
  • 34
shabda
  • 1,668
  • 1
  • 18
  • 28
  • 2
    There are ways, as long as you can live with mis-detections, because there's no 100% sure-fire way to guess the encoding. – Joachim Sauer May 27 '13 at 10:55
  • possible duplicate of [Is there a Python library function which attempts to guess the character-encoding of some bytes?](http://stackoverflow.com/questions/269060/is-there-a-python-library-function-which-attempts-to-guess-the-character-encodin) – Joachim Sauer May 27 '13 at 10:56
  • You can detect the encoding pretty reliably if you know the language these files are in - do you? – georg May 27 '13 at 11:02
  • They will be in english most of time, but I can't be sure. This should accept any csv. – shabda May 27 '13 at 11:42
  • @shabda If you are language-agnostic, then MAYBE this counts for the encoding as well. In this case - and if you just write the data into another file or so - you can assume `latin1` as this takes all data "as they are" (bytes -> unicode) and write them out again (or, in Py2, stay in `str` instead of `unicode`). – glglgl May 27 '13 at 13:17
  • I have yet to see a CSV file which is not in UTF-8. I suggest not supporting encodings other than UTF-8. – Pavel Radzivilovsky May 28 '13 at 19:26

1 Answers1

8

There is no correct way to determine the encoding of a file by looking at only the file itself, but you can use some heuristics-based solution, eg.: chardet

asciimoo
  • 631
  • 5
  • 9