0

In Windows, the value of the Unicode character ö (Latin small letter o with diaeresis) in the CP437 character set is 148.

In Linux, the byte value for ö in the UTF-8 encoding is:

-61(Hi Byte) 
-74(Lo Byte)
(unsigned value = 46787)

My Question is, how can I convert from 148 from CP437 to UTF-8 in C++ on Linux?

The detailed info for my problem lies here:

open() function in Linux with extended characters (128-255) returns -1 error

Temporary solution: C++11 supports the conversion to UTF-8 using codecvt_utf8

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
adam
  • 23
  • 1
  • 6
  • 6
    The `CP437` character set of `ö` is 148, which has nothing to do with Unicode. The unicode value for `ö` is 246, which has nothing to do with CP437. Linux has nothing to do with any of this. UTF8 is an encoding, not a locale. – Mooing Duck Oct 25 '17 at 20:40
  • 1
    The question is: "how to convert text from CP437 encoding to UTF8 encoding?" – Mooing Duck Oct 25 '17 at 20:42
  • @MooingDuck Thanks. Any suggesstions on how to do this? – adam Oct 25 '17 at 20:45
  • 5
    Interpreting UTF-8 as little-endian 16-bit integers is also not a thing anyone should do. – Josh Lee Oct 25 '17 at 20:49

3 Answers3

5

On Windows, you can use the Win32 MultiByteToWideChar() function to convert data from CP437 to UTF-16, and then use the WideCharToMultiByte() function to convert data from UTF-16 to UTF-8.

On Linux, you can use a Unicode conversion library, like libiconv or ICU (which are available for Windows, too).


In C++11 and later, you can use std::wstring_convert to:

  • convert from CP437 to either UTF-16 or UTF-32/UCS-4 (if you can get/make a codecvt for CP437, that is).

  • then, convert from UTF-16 or UTF-32/UCS-4 to UTF-8.

You can't use codecvt_utf8 to convert from CP437 to UTF-8 directly. It only supports conversions between:

  • UTF-8 and UCS-2 (not UTF-16!)

  • UTF-8 and UTF-32/UCS-4.

You have to use codecvt_utf8_utf16 for conversions between UTF-8 and UTF-16.

Or, you can use mbrtoc16() to convert CP437 to UTF-16 using a CP437 locale, and then use c16rtomb() to convert UTF-16 to UTF-8 using a UTF-8 locale (if your STL library implements a fix for DR488, otherwise c16rtomb() only supports UCS-2 and not UTF-16!).


Otherwise, just create your own CP437-to-UTF8 lookup table for the 256 possible CP437 bytes, and then do the conversion manually, one byte at a time.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • That's actually a good information. Could you please help sharing more information about the usage these conversion library which could be helpful – adam Oct 25 '17 at 21:10
  • Read their respective documentations and examples – Remy Lebeau Oct 25 '17 at 21:24
  • @adam if you want to roll your own the [Wikipedia page for CP437](https://en.wikipedia.org/wiki/Code_page_437) contains the code points for all 256 characters. Encoding those into UTF-8 is easy, see https://stackoverflow.com/a/148766/5987 for just one example. – Mark Ransom Oct 25 '17 at 21:42
  • 1
    Here's a better link for the conversion table: ftp://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT – Mark Ransom Oct 25 '17 at 22:11
2

It is not in C++, but you can also use bash to convert a file:

$ iconv -f CP437 -t UTF-8 input_file_name.txt -o output_file_name.txt
Ralph Bisschops
  • 1,888
  • 1
  • 22
  • 34
-1

I found this solution to Convert CP437 to UTF8. This works perfectly in LINUX

        BYTE high, low;
        WORD result;
        if (sCMResult.wChar > 0x80 && sCMResult.wChar <= 0x7ff)
        {
            low = (0xc0 | ((sCMResult.wChar >> 6) & 0x1f));
            high = (0x80 | (sCMResult.wChar & 0x3f));
            result = low | (high << 8);
        }

Full post can be found here

adam
  • 23
  • 1
  • 6
  • No, this is only half of the solution. You still need something to convert the bytes of CP437 to Unicode codepoints. If you don't you'll get valid UTF-8 output, but it will be the wrong characters. – Mark Ransom Oct 26 '17 at 18:55