0

I have a C++ program which reads text files. Currently I'm using C's fopen() to read and then fgetc() to read the next character. I typedef'd a "file character", which is actually an int (and I can change it to long without problems, obviously).

Now the program can read UTF-7 and UTF-8 text files, but what if I use UTF-16 or UTF-32 text files? Is there a way to infer the file encoding and then read the file properly? Even passing to C++'s istream's wouldn't be a problem.

too honest for this site
  • 12,050
  • 4
  • 30
  • 52
NoImaginationGuy
  • 1,795
  • 14
  • 24
  • Pray that the file has a [Byte Order Mark](https://en.wikipedia.org/wiki/Byte_order_mark) you can read and then set the stream's locale accordingly. If not, start guessing. – user4581301 Jun 15 '16 at 19:24
  • As stated in the answer for this similar question, reading the file in binary mode will bypass any limitations created by incompatibility with ASCII. – Havenard Jun 15 '16 at 19:39

2 Answers2

0

There's no way to figure it out reliably for an arbitrary byte stream. You can open the same way a binary executable file, which is not encoded in any of the mentioned encodings.

What you can do is try to guess. Treat the file as binary and read the the first 10k bytes or something like that, then compare the distribution of byte values to some canonical models you've built and see which one is the closest, and go with that one.

To build such a model you can take some texts (either stuff you already have or copy from wikipedia some articles) encode them with the various encodings and run the same algorithm to build the distribution. Average the results and use that as the canonical models for comparison. Works best when you tend to have the same kind of text (i.e. if you build the models with plain English text it might be difficult to classify documents using non-ascii characters).

If your files have a byte order mark, it helps a lot.

Sorin
  • 11,863
  • 22
  • 26
  • OP asks about UTF-x encodings (so far :) ). '\0' bytes, which are practically there 100% the time for x >= 16 (at first ASCII char) enables us to distinguish amongst these - OTOH your answer is very interesting for the generic case for many different encodings. – lorro Jun 15 '16 at 19:33
  • @lorro you are assuming English text. If the text is mostly ☀ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊ ☋ ☌ ☍ ☎ ☏ and ─ ━ │ ┃ ┄ ┅ ┆ ┇ ┈ ┉ ┊ ┋ ┌ ┍ ┎ ┏ then you're not going to see so many '\0' bytes and it might throw the detection. It will also not work with what I've suggested if your training set is only English text, but it has a chance of working if the training set contains documents like that. – Sorin Jun 16 '16 at 09:13
0

While you cannot definitely infer, in practice, you might try-and-fail based on a list of encodings.

  • UTF-16 will likely have a '\0' very early (whether at even or odd position(s) is decided by endianness, which might be little, big, or on some architectures, medium);
  • UTF-32 will likely have three of those; while
  • UTF-8 strings should not have this character.

Additionally, utf files are permitted (but not required) to store a byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark . If you have it, you are lucky, as that's different amongst the encodings.

lorro
  • 10,687
  • 23
  • 36
  • Not true for non-english text. If you need a lot of high order unicode characters you will not see those patterns so clearly. – Sorin Jun 15 '16 at 19:31
  • My question was actually because I'll have to read texts like japanese ideograms, so it is not that easy > – NoImaginationGuy Jun 15 '16 at 19:35