Convert ISO-8859-1 strings to UTF-8 in C/C++

Question

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

I found one commercial product, but it's beyond my budget at this time.

There's nothing simple about it. You could use the open source ICU library. — Hans Passant, Oct 30 '10 at 17:23
If you have to do it, then the simplest code is to pre-generate a table of the 128 (or so) UTF-8 characters corresponding to the 8859-1 characters with the top bit set. The other 128 8859-1 characters are unmodified. That way, your code doesn't have to understand Unicode at all. Also, beware the difference between ISO-8859-1 and Windows CP-1252. The latter has some extra characters in it where 8859-1 has gaps (unused code points). Unless you're supposed to be validating that your input really is ISO-8859-1, there's no point not accepting CP-1252, because you *will* see it mislabelled. — Steve Jessop, Oct 30 '10 at 17:30
@Steve: since UTF-8 is variable length (in this case, 1 or 2 bytes per character), a lookup table is not so easy to use. See my answer which should be just as fast and a lot simpler. — R.. GitHub STOP HELPING ICE, Oct 30 '10 at 17:54
@R.: well, "easy" is a relative term. `stpcpy` helps, provided you're the kind of programmer who's good with buffer sizes. — Steve Jessop, Oct 30 '10 at 18:48
`stpcpy` (even if it is standard or headed towards being standard now..?) is a helluvalot of overhead for 1- and 2-byte copies. You'd be better off just always copying 2 bytes (by hand) and including some code to skip the second pointer advance if the byte copied was 0 (which can almost surely be branchless). — R.. GitHub STOP HELPING ICE, Oct 31 '10 at 16:48

score 40 · Accepted Answer · answered Oct 30 '10 at 17:53

40

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

answered Oct 30 '10 at 17:53

R.. GitHub STOP HELPING ICE

208,859
35
376
711

3

Wow. This is very helpful! I wasn't looking forward to yet-another table lookup algorithm. Now for ANSEL-to-UTF-8... – gordonwd Oct 30 '10 at 18:31
11

This certainly answers the question. But as I said in a comment above, people *will* send you CP-1252 mislabelled as ISO-8859-1. Web servers are the example that I've tripped over that persuaded me of the problem, but also text editors that claim to be saving as "Latin-1" when they aren't. That "if your source encoding will always be ISO-8859-1" is a pretty big "if", and it might be hard to track down and eliminate the miscreant responsible. – Steve Jessop Oct 30 '10 at 18:46
1

@Steve: You could add an `else if (*in<192) goto error;` case to error-out on encountering any ISO-8859-1 control codes (which are probably misencoded Windows-1252 characters, and not useful characters anyway). – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:36
2

@gordon: I'm not familiar with ANSEL, but you should be aware that ISO-8859-1 is the **only** legacy encoding that's this easy to convert to UTF-8. Everything else will require lookup tables. A Steve said, my "If.." is a **big** if. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:37
7

This is quite poorly written code from a maintainability standpoint. Use more braces. – syb0rg Feb 04 '14 at 00:18
how would i simplify this to do only 1 character? I'm trying to understand what this code does, and the simplification will help me to understand.. – user230910 Sep 29 '15 at 07:56
@MaximEgorushkin Not trying to defend the code, but it does have `,`, which acts as a sequence point. – user694733 Sep 29 '15 at 08:18
@user694733 You are right, there is a sequence point at built-in `,` operator. – Maxim Egorushkin Sep 29 '15 at 08:25
@R.. < 192 "not useful"? 163, the £ sign, is useful for us folks in the UK. I think you meant 160 rather than 192. – Nick Jun 16 '19 at 15:59
2

@Nick: Yep, I meant 0xA0 and just converted to decimal in my head incorrectly. Comment is way too old to edit though. – R.. GitHub STOP HELPING ICE Jun 16 '19 at 20:06

score 18 · Answer 2 · answered Oct 05 '16 at 21:37

18

To c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}

answered Oct 05 '16 at 21:37

Lord Raiden

301
2
3

Can You please share the Latin7 version? – Ronalds Mazītis May 27 '21 at 16:58
@RonaldsMazītis As Latin7 has no 1:1 mapping with Unicode, it requires a conversion lookup table, there's no trivial way to do it. – Ale Aug 10 '23 at 14:16

score 5 · Answer 3 · edited Dec 11 '20 at 14:38

5

You can use the boost::locale library:

http://www.boost.org/doc/libs/1_49_0/libs/locale/doc/html/charset_handling.html

The code would look like this:

#include <boost/locale.hpp>
std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string,"Latin1");

edited Dec 11 '20 at 14:38

jpo38

20,821
10
70
151

answered May 31 '17 at 12:09

Spacemoose

3,856
1
27
48

score 3 · Answer 4 · edited Jan 24 '18 at 12:36

3

The C++03 standard does not provide functions to directly convert between specific charsets.

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

edited Jan 24 '18 at 12:36

Cheers and hth. - Alf

142,714
15
209
331

answered Oct 30 '10 at 17:29

cytrinox

1,846
5
25
46

> **”** The C++ standard does not provide functions to directly convert between charsets – Cheers and hth. - Alf Jan 24 '18 at 12:34

score 2 · Answer 5 · answered Oct 31 '10 at 00:44

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

o0Evolved0o · Answer 6 · 2022-07-24T20:04:29.783

-1

The code

isolat1ToUTF8(unsigned char* out, int *outlen,
              const unsigned char* in, int *inlen) {
    unsigned char* outstart = out;
    const unsigned char* base = in;
    const unsigned char* processed = in;
    unsigned char* outend = out + *outlen;
    const unsigned char* inend;
    unsigned int c;
    int bits;

    inend = in + (*inlen);
    while ((in < inend) && (out - outstart + 5 < *outlen)) {
    c= *in++;

    /* assertion: c is a single UTF-4 value */
        if (out >= outend)
        break;
        if      (c <    0x80) {  *out++=  c;                bits= -6; }
        else                  {  *out++= ((c >>  6) & 0x1F) | 0xC0;  bits=  0; }
 
        for ( ; bits >= 0; bits-= 6) {
            if (out >= outend)
            break;
            *out++= ((c >> bits) & 0x3F) | 0x80;
        }
    processed = (const unsigned char*) in;
    }
    *outlen = out - outstart;
    *inlen = processed - base;
    return(0);
}

I think this could be helpfull! And sorry for my last comment what was deleted! I can give you the link if needed there is a full explanation in a .c file. I have got this from it. Cheers!

edited Jul 24 '22 at 20:04

answered Jul 24 '22 at 19:39

o0Evolved0o

11
2

A link to an image of code does not meet the standards for a Stackoverflow answer. The link can go bad, and code from images can not be directly copied. – Andrew Henle Jul 24 '22 at 19:58
1

Hi, and welcome to Stack Overflow! Note that code here should be presented as formatted source code _text_, not image. Please read the help section of the site for more! Also the question is over 12 years old, and while it is good to write up-to-date answers to old questions, your answer seems to contain code very similar to what is already found in some answers under this Q. – hyde Jul 24 '22 at 19:58
@AndrewHenle Hey! Is better so? – o0Evolved0o Jul 24 '22 at 20:05
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](https://stackoverflow.com/review/late-answers/32324127) – Nol4635 Jul 28 '22 at 00:26

score -2 · Answer 7 · answered Oct 30 '10 at 17:39

-2

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

The C++ aspects -- integrating that with iostreams -- are much harder.

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

Cheers & hth.,

answered Oct 30 '10 at 17:39

Cheers and hth. - Alf

142,714
15
209
331

The algorithm is not entirely trivial, especially when novice to intermediate C coders often mistakenly use `char *` where `unsigned char *` is needed. More significant nontrivialities are in the definition of UTF-8, specifically that you need to reject surrogate codepoints and out-of-range values. Thankfully those won't come up in an encoder that only needs to handle ISO-8859-1 input, but if you write such a limited encoder it's likely someone will end up misusing it for a larger input range later without adding any checks. – R.. GitHub STOP HELPING ICE Oct 31 '10 at 01:40
@MichałLeon: Unicode is not an encoding. There are a number of different encodings of Unicode, including UTF-8 and UTF-16. The first 256 code points of Unicode are the same as Latin 1 (a.k.a. ISO-8859-1). Note: emphasis doesn't make you less at odds with trivial fact. Next time, instead of shouting and downvoting, consider simply checking facts, or just ask about anything you don't understand. – Cheers and hth. - Alf Jan 23 '18 at 17:23
@Martin: The block of Unicode code points 128 through 255 is called the ["Latin-1 supplement" of Unicode](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)), because it's the same as Latin-1. Unicode is a direct extension of Latin-1. You comments are absurd nonsense, the kind of techno-babble that can influence non-technical people and indicates trolling. I presume you're trolling. – Cheers and hth. - Alf Jan 24 '18 at 10:59
@MichałLeon: OK, sorry. I should maybe have guessed: I have for many years helped a student with extremely bad eye-sight, and she routinely fails to see what's right there. Latin-1 is specified in the OP's posting, in my answer, in all my comments, and in the other answers except one. – Cheers and hth. - Alf Jan 24 '18 at 13:50

Convert ISO-8859-1 strings to UTF-8 in C/C++

7 Answers7

Linked