12

I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).

I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1). And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.

I tried just using the standard Linux file

file -bi name.txt

And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.

So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?

Niklas9
  • 8,816
  • 8
  • 37
  • 60
  • If it's 100% accurate, then I'm wondering why someone hasn't implemented it (or `chardet`) using the same rules that `file` uses... - have you tried a `file` vs `chardet` comparison across a significant amount of test data? – Jon Clements Nov 27 '12 at 20:01
  • FWIW, ISO-8859-1 (and its revision, -15) is not just Scandinavian, it's used for many other Latin-based scripts. If the input is "mostly ASCII" and not UTF-8, ISO-8859-1 is a pretty good guess. http://en.wikipedia.org/wiki/ISO/IEC_8859#The_Parts_of_ISO.2FIEC_8859 – Thomas Nov 27 '12 at 20:04
  • Jon, I totally agree. Hence my question. I don't have access to enough data that would make this approach statistically significant, so the answer to your question is no, unfortunately. – Niklas9 Nov 27 '12 at 21:22
  • Thomas, yes, sorry, you're completely correct. The issue I ran into involved Scandinavian languages, I guess that's why I wrote it as an example. Yes, agree on that it would probably be a good guess, but if there's a fast method that's more accurate - I would prefer to use it. – Niklas9 Nov 27 '12 at 21:24

2 Answers2

4

Old MS-DOS and Windows formatted files can be detected as unknown-8bit instead of ISO-8859-X, due to not completely standard encondings. Chardet instead will perform an educated guess, reporting a confidence value.

http://www.faqs.org/faqs/internationalization/iso-8859-1-charset/

If you won't handle old, exotic, out-of-standard text files, I think you can use file -i without many problems.

GendoIkari
  • 244
  • 1
  • 7
  • Thanks for your answer, makes sense. Do you have an example of such a file? Old MS-DOS or Windows formatted I mean. – Niklas9 Nov 30 '12 at 09:34
  • This can be an example i think. It's an old text file from a MS-DOS application, 1988. File -i on my Ubuntu 12.04 detects it as application/octet-stream; charset=binary. There's a wrong character somewhere. I'm not the MASTER ENCONDER, but if you open it with okteta you can see binary data, (09 bytes) everywhere. If there's another explanation let me know, thank you. http://filebin.ca/OOQ4WVHhaKT – GendoIkari Nov 30 '12 at 12:08
2

I have found "chared" (http://code.google.com/p/chared/) to be pretty accurate. You can even train new encoding detectors for languages that not supported.

It might be a good alternative when chardet starts acting up.

Paulo Malvar
  • 709
  • 5
  • 4
  • Cool, thanks. It seems to have one extra requirement though, you have to know the language used in the text. Usually I don't know that in my app.. But it definitely seems to be a good alternative. – Niklas9 Feb 22 '13 at 09:02
  • 1
    Yes, you need to know the language but you could guess it using for example langid (https://github.com/saffsd/langid.py). – Paulo Malvar Feb 22 '13 at 16:54