how to convert text from CP437 encoding to UTF8 encoding?

Question

In Windows, the value of the Unicode character ö (Latin small letter o with diaeresis) in the CP437 character set is 148.

In Linux, the byte value for ö in the UTF-8 encoding is:

-61(Hi Byte) 
-74(Lo Byte)
(unsigned value = 46787)

My Question is, how can I convert from 148 from CP437 to UTF-8 in C++ on Linux?

The detailed info for my problem lies here:

open() function in Linux with extended characters (128-255) returns -1 error

Temporary solution: C++11 supports the conversion to UTF-8 using codecvt_utf8

The `CP437` character set of `ö` is 148, which has nothing to do with Unicode. The unicode value for `ö` is 246, which has nothing to do with CP437. Linux has nothing to do with any of this. UTF8 is an encoding, not a locale. — Mooing Duck, Oct 25 '17 at 20:40
The question is: "how to convert text from CP437 encoding to UTF8 encoding?" — Mooing Duck, Oct 25 '17 at 20:42
Interpreting UTF-8 as little-endian 16-bit integers is also not a thing anyone should do. — Josh Lee, Oct 25 '17 at 20:49

Remy Lebeau · Accepted Answer · 2017-10-25T21:24:19.650

5

On Windows, you can use the Win32 MultiByteToWideChar() function to convert data from CP437 to UTF-16, and then use the WideCharToMultiByte() function to convert data from UTF-16 to UTF-8.

On Linux, you can use a Unicode conversion library, like libiconv or ICU (which are available for Windows, too).

In C++11 and later, you can use std::wstring_convert to:

convert from CP437 to either UTF-16 or UTF-32/UCS-4 (if you can get/make a codecvt for CP437, that is).
then, convert from UTF-16 or UTF-32/UCS-4 to UTF-8.

You can't use codecvt_utf8 to convert from CP437 to UTF-8 directly. It only supports conversions between:

UTF-8 and UCS-2 (not UTF-16!)
UTF-8 and UTF-32/UCS-4.

You have to use codecvt_utf8_utf16 for conversions between UTF-8 and UTF-16.

Or, you can use mbrtoc16() to convert CP437 to UTF-16 using a CP437 locale, and then use c16rtomb() to convert UTF-16 to UTF-8 using a UTF-8 locale (if your STL library implements a fix for DR488, otherwise c16rtomb() only supports UCS-2 and not UTF-16!).

Otherwise, just create your own CP437-to-UTF8 lookup table for the 256 possible CP437 bytes, and then do the conversion manually, one byte at a time.

edited Oct 25 '17 at 21:24

answered Oct 25 '17 at 21:01

Remy Lebeau

555,201
31
458
770

That's actually a good information. Could you please help sharing more information about the usage these conversion library which could be helpful – adam Oct 25 '17 at 21:10
Read their respective documentations and examples – Remy Lebeau Oct 25 '17 at 21:24
@adam if you want to roll your own the [Wikipedia page for CP437](https://en.wikipedia.org/wiki/Code_page_437) contains the code points for all 256 characters. Encoding those into UTF-8 is easy, see https://stackoverflow.com/a/148766/5987 for just one example. – Mark Ransom Oct 25 '17 at 21:42
1

Here's a better link for the conversion table: ftp://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT – Mark Ransom Oct 25 '17 at 22:11

score 2 · Answer 2 · answered Feb 13 '20 at 22:31

2

It is not in C++, but you can also use bash to convert a file:

$ iconv -f CP437 -t UTF-8 input_file_name.txt -o output_file_name.txt

answered Feb 13 '20 at 22:31

Ralph Bisschops

1,888
1
22
34

Note that this does not correctly translate control characters. – fuz Jul 19 '22 at 19:47

score -1 · Answer 3 · answered Oct 26 '17 at 18:45

-1

I found this solution to Convert CP437 to UTF8. This works perfectly in LINUX

        BYTE high, low;
        WORD result;
        if (sCMResult.wChar > 0x80 && sCMResult.wChar <= 0x7ff)
        {
            low = (0xc0 | ((sCMResult.wChar >> 6) & 0x1f));
            high = (0x80 | (sCMResult.wChar & 0x3f));
            result = low | (high << 8);
        }

Full post can be found here

answered Oct 26 '17 at 18:45

adam

23
1
6

No, this is only half of the solution. You still need something to convert the bytes of CP437 to Unicode codepoints. If you don't you'll get valid UTF-8 output, but it will be the wrong characters. – Mark Ransom Oct 26 '17 at 18:55

how to convert text from CP437 encoding to UTF8 encoding?

3 Answers3