7

I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.

I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.

How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.

Chris Redford
  • 16,982
  • 21
  • 89
  • 109
  • That sounds like a potentially big project - and exactly the kind of thing a library like iconv is good for. What's wrong with doing it the right way? – Carl Norum May 15 '14 at 22:35
  • I'm fine using iconv if that's the only available way to do it. It definitely isn't the most elegant C++ solution imaginable. Something like `s.toEncoding("ISO-8859-1")` would be much more elegant. My point is, even if I'm doing it in iconv, it isn't clear to me how to use the library with `string` input. – Chris Redford May 15 '14 at 22:58
  • Not sure, but may be it can help: http://www.openldap.org/lists/openldap-devel/200304/msg00123.html – gerbit May 15 '14 at 23:00

3 Answers3

12

I'm going to modify my code from another answer to implement the suggestion from Alf.

std::string UTF8toISO8859_1(const char * in)
{
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (codepoint <= 255)
            {
                out.append(1, static_cast<char>(codepoint));
            }
            else
            {
                // do whatever you want for out-of-bounds characters
            }
        }
    }
    return out;
}

Invalid UTF-8 input results in dropped characters.

Community
  • 1
  • 1
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Actually, I have UTF-8 `string` coming in. If you made it `string`-to-`string` that would be perfect. – Chris Redford May 15 '14 at 23:21
  • @ChrisRedford, just call it with `mystr.c_str()`. I like having the `const char *` input because it's more flexible. – Mark Ransom May 16 '14 at 00:02
  • 1
    Since the input comes from a `std::string`, just replace `const char * in` with `const std::string& in`, and then create a local `char*` variable that is assigned `in.c_str()` for use in the loop, and use `in.size()` as a loop counter instead of `*in != 0`. Or use `in.begin()` and `in.end()` iterators. – Remy Lebeau May 19 '14 at 23:30
  • If you're looking for a way to convert std::string with utf-8 characters to iso 8859 or Windows 1252 encoding, here there are a function that do that, using hardcoded conversion, no calls to codecvt_utf8 (), iconv () or similar functions. It use a similar Mark Ransom loop. https://github.com/agnasg/utils – Gustavo Rodríguez Aug 02 '18 at 12:41
  • @GustavoRodríguez that's easy to do because Unicode adopted the Latin-1 character set for its first 256 codepoints - no translation necessary. – Mark Ransom Aug 02 '18 at 12:47
  • Maybe using `char8_t` is better. – Константин Ван Feb 24 '21 at 11:53
  • And note: this converts to ISO_8859-1:1987, a superset of ISO/IEC 8859-1:1998. – Константин Ван Feb 25 '21 at 04:30
6

First convert UTF-8 to 32-bit Unicode.

Then keep the values that are in the range 0 through 255.

Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.


The C++ standard library defines a std::codecvt specialization that can be used,

template<>
codecvt<char32_t, char, mbstate_t>

C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes”

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • This works well, simply because Unicode was defined as a superset of ISO-8859-1 to begin with. See http://en.wikipedia.org/wiki/Unicode#Origin_and_development P.S. as a starting point for the conversion might I suggest http://stackoverflow.com/a/148766/5987 – Mark Ransom May 15 '14 at 23:02
  • But, but, isn’t `std::codecvt` deprecated in C++17? – Константин Ван Feb 24 '21 at 09:36
2

Alfs suggestion implemented in C++11

#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
           [](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"
cypres
  • 366
  • 2
  • 8