I'm decoding a large (about a gigabyte) flat file database, which mixes character encodings willy nilly. The python module chardet
is doing a good job, so far, of identifying the encodings, but if hit a stumbling block...
In [428]: badish[-3]
Out[428]: '\t\t\t"Kuzey r\xfczgari" (2007) {(#1.2)} [Kaz\xc4\xb1m]\n'
In [429]: chardet.detect(badish[-3])
Out[429]: {'confidence': 0.98999999999999999, 'encoding': 'Big5'}
In [430]: unicode(badish[-3], 'Big5')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~/src/imdb/<ipython console> in <module>()
UnicodeDecodeError: 'big5' codec can't decode bytes in position 11-12: illegal multibyte sequence
chardet reports a very high confidence in it's choice of encodings, but it doesn't decode... Are there any other sensible approaches?