25

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

I found one commercial product, but it's beyond my budget at this time.

Machavity
  • 30,841
  • 27
  • 92
  • 100
gordonwd
  • 4,537
  • 9
  • 37
  • 53
  • 4
    There's nothing simple about it. You could use the open source ICU library. – Hans Passant Oct 30 '10 at 17:23
  • 3
    If you have to do it, then the simplest code is to pre-generate a table of the 128 (or so) UTF-8 characters corresponding to the 8859-1 characters with the top bit set. The other 128 8859-1 characters are unmodified. That way, your code doesn't have to understand Unicode at all. Also, beware the difference between ISO-8859-1 and Windows CP-1252. The latter has some extra characters in it where 8859-1 has gaps (unused code points). Unless you're supposed to be validating that your input really is ISO-8859-1, there's no point not accepting CP-1252, because you *will* see it mislabelled. – Steve Jessop Oct 30 '10 at 17:30
  • @Steve: since UTF-8 is variable length (in this case, 1 or 2 bytes per character), a lookup table is not so easy to use. See my answer which should be just as fast and a lot simpler. – R.. GitHub STOP HELPING ICE Oct 30 '10 at 17:54
  • @R.: well, "easy" is a relative term. `stpcpy` helps, provided you're the kind of programmer who's good with buffer sizes. – Steve Jessop Oct 30 '10 at 18:48
  • `stpcpy` (even if it is standard or headed towards being standard now..?) is a helluvalot of overhead for 1- and 2-byte copies. You'd be better off just always copying 2 bytes (by hand) and including some code to skip the second pointer advance if the byte copied was 0 (which can almost surely be branchless). – R.. GitHub STOP HELPING ICE Oct 31 '10 at 16:48

7 Answers7

40

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 3
    Wow. This is very helpful! I wasn't looking forward to yet-another table lookup algorithm. Now for ANSEL-to-UTF-8... – gordonwd Oct 30 '10 at 18:31
  • 11
    This certainly answers the question. But as I said in a comment above, people *will* send you CP-1252 mislabelled as ISO-8859-1. Web servers are the example that I've tripped over that persuaded me of the problem, but also text editors that claim to be saving as "Latin-1" when they aren't. That "if your source encoding will always be ISO-8859-1" is a pretty big "if", and it might be hard to track down and eliminate the miscreant responsible. – Steve Jessop Oct 30 '10 at 18:46
  • 1
    @Steve: You could add an `else if (*in<192) goto error;` case to error-out on encountering any ISO-8859-1 control codes (which are probably misencoded Windows-1252 characters, and not useful characters anyway). – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:36
  • 2
    @gordon: I'm not familiar with ANSEL, but you should be aware that ISO-8859-1 is the **only** legacy encoding that's this easy to convert to UTF-8. Everything else will require lookup tables. A Steve said, my "If.." is a **big** if. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:37
  • 7
    This is quite poorly written code from a maintainability standpoint. Use more braces. – syb0rg Feb 04 '14 at 00:18
  • how would i simplify this to do only 1 character? I'm trying to understand what this code does, and the simplification will help me to understand.. – user230910 Sep 29 '15 at 07:56
  • @MaximEgorushkin Not trying to defend the code, but it does have `,`, which acts as a sequence point. – user694733 Sep 29 '15 at 08:18
  • @user694733 You are right, there is a sequence point at built-in `,` operator. – Maxim Egorushkin Sep 29 '15 at 08:25
  • @R.. < 192 "not useful"? 163, the £ sign, is useful for us folks in the UK. I think you meant 160 rather than 192. – Nick Jun 16 '19 at 15:59
  • 2
    @Nick: Yep, I meant 0xA0 and just converted to decimal in my head incorrectly. Comment is way too old to edit though. – R.. GitHub STOP HELPING ICE Jun 16 '19 at 20:06
18

To c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}
Lord Raiden
  • 301
  • 2
  • 3
5

You can use the boost::locale library:

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

The code would look like this:

#include <boost/locale.hpp>
std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string,"Latin1");
jpo38
  • 20,821
  • 10
  • 70
  • 151
Spacemoose
  • 3,856
  • 1
  • 27
  • 48
3

The C++03 standard does not provide functions to directly convert between specific charsets.

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
cytrinox
  • 1,846
  • 5
  • 25
  • 46
2

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

RBerteig
  • 41,948
  • 7
  • 88
  • 128
-1

The code

isolat1ToUTF8(unsigned char* out, int *outlen,
              const unsigned char* in, int *inlen) {
    unsigned char* outstart = out;
    const unsigned char* base = in;
    const unsigned char* processed = in;
    unsigned char* outend = out + *outlen;
    const unsigned char* inend;
    unsigned int c;
    int bits;

    inend = in + (*inlen);
    while ((in < inend) && (out - outstart + 5 < *outlen)) {
    c= *in++;

    /* assertion: c is a single UTF-4 value */
        if (out >= outend)
        break;
        if      (c <    0x80) {  *out++=  c;                bits= -6; }
        else                  {  *out++= ((c >>  6) & 0x1F) | 0xC0;  bits=  0; }
 
        for ( ; bits >= 0; bits-= 6) {
            if (out >= outend)
            break;
            *out++= ((c >> bits) & 0x3F) | 0x80;
        }
    processed = (const unsigned char*) in;
    }
    *outlen = out - outstart;
    *inlen = processed - base;
    return(0);
}

I think this could be helpfull! And sorry for my last comment what was deleted! I can give you the link if needed there is a full explanation in a .c file. I have got this from it. Cheers!

  • A link to an image of code does not meet the standards for a Stackoverflow answer. The link can go bad, and code from images can not be directly copied. – Andrew Henle Jul 24 '22 at 19:58
  • 1
    Hi, and welcome to Stack Overflow! Note that code here should be presented as formatted source code _text_, not image. Please read the help section of the site for more! Also the question is over 12 years old, and while it is good to write up-to-date answers to old questions, your answer seems to contain code very similar to what is already found in some answers under this Q. – hyde Jul 24 '22 at 19:58
  • @AndrewHenle Hey! Is better so? – o0Evolved0o Jul 24 '22 at 20:05
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](https://stackoverflow.com/review/late-answers/32324127) – Nol4635 Jul 28 '22 at 00:26
-2

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

The C++ aspects -- integrating that with iostreams -- are much harder.

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

Cheers & hth.,

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • The algorithm is not entirely trivial, especially when novice to intermediate C coders often mistakenly use `char *` where `unsigned char *` is needed. More significant nontrivialities are in the definition of UTF-8, specifically that you need to reject surrogate codepoints and out-of-range values. Thankfully those won't come up in an encoder that only needs to handle ISO-8859-1 input, but if you write such a limited encoder it's likely someone will end up misusing it for a larger input range later without adding any checks. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:40
  • @MichałLeon: Unicode is not an encoding. There are a number of different encodings of Unicode, including UTF-8 and UTF-16. The first 256 code points of Unicode are the same as Latin 1 (a.k.a. ISO-8859-1). Note: emphasis doesn't make you less at odds with trivial fact. Next time, instead of shouting and downvoting, consider simply checking facts, or just ask about anything you don't understand. – Cheers and hth. - Alf Jan 23 '18 at 17:23
  • @Martin: The block of Unicode code points 128 through 255 is called the ["Latin-1 supplement" of Unicode](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)), because it's the same as Latin-1. Unicode is a direct extension of Latin-1. You comments are absurd nonsense, the kind of techno-babble that can influence non-technical people and indicates trolling. I presume you're trolling. – Cheers and hth. - Alf Jan 24 '18 at 10:59
  • @MichałLeon: OK, sorry. I should maybe have guessed: I have for many years helped a student with extremely bad eye-sight, and she routinely fails to see what's right there. Latin-1 is specified in the OP's posting, in my answer, in all my comments, and in the other answers except one. – Cheers and hth. - Alf Jan 24 '18 at 13:50