0

I have been trying to convert the ISO-8859 charset to utf-8 with the code obtained from : Convert ISO-8859-1 strings to UTF-8 in C/C++ Here is my code :

#include <iostream>
#include <string>

using namespace std;
int main(int argc,char* argv[])
{
    string fileName ="ħëlö";
    int len= fileName.length();
    char* in = new char[len+1];
    char* out = new char[2*(len+1)];
    memset(in,'\0',len+1);
    memset(out,'\0',len+1);
    memcpy(in,fileName.c_str(),2*(len+1));


    while( *in )
    {
            cout << " ::: " << in ;
            if( *in <128 )
            {
                    *out++ = *in++;
            }
            else
            {
                    *out++ = 0xc2+(*in>0xbf);
                    *out++ = (*in++&0x3f)+0x80;
            }
    }
    cout << "\n\n out ::: " << out << "\n";
    *out = '\0';
}

But the output is

::: ħëlö ::: ?ëlö ::: ëlö ::: ?lö ::: lö ::: ö ::: ?

 out :::   

The output 'out' should be a utf-8 string and it is not. I'm getting this in Mac OS X..

What am i doing wrong here ..?

Community
  • 1
  • 1
Zeus
  • 571
  • 1
  • 7
  • 23
  • (1) there is a #include missing. (2) What do you expect to be the output? Please clarify. (3) What does it have to do with osx (tag)? Btw: confirmed the behaviour on a linux with gcc 4.7.2 – steffen Jan 08 '13 at 14:48
  • I'm not sure that std::cout will behave well with your UTF-8 encoded string. That could be the problem, rather than the conversion code. – Steve Jan 08 '13 at 14:50
  • After you fix the pointer problem in @unwind's answer, make sure your shell is set to UTF-8: http://stackoverflow.com/questions/4606570/os-x-terminal-utf-8-issues – japreiss Jan 08 '13 at 14:51
  • You should probably use `unsigned char` rather than just `char`, since you need to deal with values above 128. – aschepler Jan 08 '13 at 15:01

2 Answers2

2

You are incrementing the out pointer in the loop, causing you to lose track of where the output starts. The pointer being passed to cout is the incremented one, so it obviously doesn't point at the start of the generated output any longer.

Further, the termination of out happens after printing it, which of course is the wrong way around.

Also, this relies on the encoding of the source code and stuff, not very nice. You should express the input string differently, using individual characters with hex values or something to be on the safe side.

unwind
  • 391,730
  • 64
  • 469
  • 606
1

ISO-8859-1 does not have the character ħ so your source cannot possibly be in ISO-8859-1 as the method requires. Or your source is in ISO-8859-1, but ħ will be replaced with ? once you save it.

Esailija
  • 138,174
  • 23
  • 272
  • 326