-2

I`m trying to open an existing file and read it, eg:

std::string text = fileOpenRead(readonly, filePath);

Then I want to change the string's encoding to UTF-8 and save it.

So, I need two APIs:

  1. Find a file's existing encoding.

  2. Convert the data from the above encoding to UTF-8.

I searched Google and StackOverflow, but I can`t find a perfect solution.

Can anyone please share some hints with me?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • It is not possible to reliably guess the encoding of a file. Click the topmost topic under `related` on the right hand side for [How can I detect the encoding/codepage of a text file](http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file). – dxiv Jul 07 '16 at 01:06

2 Answers2

1

Step #1 is very difficult to accomplish if the file is not already using a UTF encoding, like UTF-8 or UTF-16 (UTF-8 is very easy to detect, and UTF-16 is also fairly easy to some extent, if a BOM is not present).

There are MANY encodings used in the world (Unicode was designed to replace them all, but that goal has not been achieved 100% globally yet), and many non-ASCII encodings cannot accurately be detected without context, or prior knowledge of the encoding that was used to create the file. Unless you can ask the user for the specific encoding, you will have to resort to heuristic analysis of the data (and there are some 3rd party charset detection libraries if you search around), and that is error-prone without context information.

See this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Without context, the same data can be interpreted in different ways, producing different results. For example, such an issue affects something as "simple" as Notepad in Windows when a file's encoding has to be guessed. This is a good example of how guessing can go wrong:

Notepad bug? Encoding issue?

Some files come up strange in Notepad

The Notepad file encoding problem, redux

Bush hid the facts

No matter how good your heuristics may be, you are still guessing, and guessing is not 100% reliable. So do yourself a favor and don't guess at all.

As for Step #2, once you have determined a source encoding, you should use a portable Unicode library for converting from that encoding to UTF-8, such as libiconv or ICU.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
0

There's nothing about a particular file that specifies its encoding, in a universal manner that's applicable to every operating system in the world.

Individual operating systems may provide file-specific metadata that defines what kind of content in the file; like what encoding a text file is using.

But there's nothing in the standard C++ library that returns an arbitrary file's encoding.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148