Convert string from UTF-8 to ISO-8859-1

Question

I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.

I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.

How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.

That sounds like a potentially big project - and exactly the kind of thing a library like iconv is good for. What's wrong with doing it the right way? — Carl Norum, May 15 '14 at 22:35
I'm fine using iconv if that's the only available way to do it. It definitely isn't the most elegant C++ solution imaginable. Something like `s.toEncoding("ISO-8859-1")` would be much more elegant. My point is, even if I'm doing it in iconv, it isn't clear to me how to use the library with `string` input. — Chris Redford, May 15 '14 at 22:58
Not sure, but may be it can help: http://www.openldap.org/lists/openldap-devel/200304/msg00123.html — gerbit, May 15 '14 at 23:00

score 12 · Answer 1 · edited May 23 '17 at 12:33

12

I'm going to modify my code from another answer to implement the suggestion from Alf.

std::string UTF8toISO8859_1(const char * in)
{
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (codepoint <= 255)
            {
                out.append(1, static_cast<char>(codepoint));
            }
            else
            {
                // do whatever you want for out-of-bounds characters
            }
        }
    }
    return out;
}

Invalid UTF-8 input results in dropped characters.

edited May 23 '17 at 12:33

Community

1
1

answered May 15 '14 at 23:10

Mark Ransom

299,747
42
398
622

Actually, I have UTF-8 `string` coming in. If you made it `string`-to-`string` that would be perfect. – Chris Redford May 15 '14 at 23:21
@ChrisRedford, just call it with `mystr.c_str()`. I like having the `const char *` input because it's more flexible. – Mark Ransom May 16 '14 at 00:02
1

Since the input comes from a `std::string`, just replace `const char * in` with `const std::string& in`, and then create a local `char*` variable that is assigned `in.c_str()` for use in the loop, and use `in.size()` as a loop counter instead of `*in != 0`. Or use `in.begin()` and `in.end()` iterators. – Remy Lebeau May 19 '14 at 23:30
If you're looking for a way to convert std::string with utf-8 characters to iso 8859 or Windows 1252 encoding, here there are a function that do that, using hardcoded conversion, no calls to codecvt_utf8 (), iconv () or similar functions. It use a similar Mark Ransom loop. https://github.com/agnasg/utils – Gustavo Rodríguez Aug 02 '18 at 12:41
@GustavoRodríguez that's easy to do because Unicode adopted the Latin-1 character set for its first 256 codepoints - no translation necessary. – Mark Ransom Aug 02 '18 at 12:47
Maybe using `char8_t` is better. – Константин Ван Feb 24 '21 at 11:53
And note: this converts to ISO_8859-1:1987, a superset of ISO/IEC 8859-1:1998. – Константин Ван Feb 25 '21 at 04:30

Cheers and hth. - Alf · Answer 2 · 2014-05-15T23:11:40.930

6

First convert UTF-8 to 32-bit Unicode.

Then keep the values that are in the range 0 through 255.

Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.

The C++ standard library defines a std::codecvt specialization that can be used,

template<>
codecvt<char32_t, char, mbstate_t>

C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes”

edited May 15 '14 at 23:11

answered May 15 '14 at 22:56

Cheers and hth. - Alf

142,714
15
209
331

This works well, simply because Unicode was defined as a superset of ISO-8859-1 to begin with. See http://en.wikipedia.org/wiki/Unicode#Origin_and_development P.S. as a starting point for the conversion might I suggest http://stackoverflow.com/a/148766/5987 – Mark Ransom May 15 '14 at 23:02
But, but, isn’t `std::codecvt` deprecated in C++17? – Константин Ван Feb 24 '21 at 09:36

cypres · Answer 3 · 2016-08-31T19:15:48.647

Alfs suggestion implemented in C++11

#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
           [](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"

Convert string from UTF-8 to ISO-8859-1

3 Answers3

Linked

Related