I have to import some UTF-8 encoded text-file into my C++Builder 5 program. Are there any components or code samples to accomplish that?
-
ANSI is the American National Standards Institute. So I think you mean ASCII. – Gumbo Jan 24 '09 at 12:10
-
Most likely he means Windows-1252 (also known as WinLatin1), which includes ASCII, but adds another 128 code points... – Christoph Jan 24 '09 at 12:45
4 Answers
Here is a more VCL-centric approach for you:
UTF8String utf8 = "...";
WideString utf16;
AnsiString latin1;
int len = ::MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), utf8.Length(), NULL, 0);
utf16.SetLength(len);
::MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), utf8.Length(), utf16.c_bstr(), len);
len = ::WideCharToMultiByte(1252, 0, utf16.c_bstr(), utf16.Length(), NULL, 0, NULL, NULL);
latin1.SetLength(len);
::WideCharToMultiByte(1252, 0, utf16.c_bstr(), utf16.Length(), latin1.c_str(), len, NULL, NULL);
If you upgrade to CB2009, you can simplify it to this:
UTF8String utf8 = "...";
AnsiString<1252> latin1 = utf8;

- 555,201
- 31
- 458
- 770
You are best off reading all the other questions on SO that are tagged unicode and c++. For starters you should probably look at this one and see whether library in the accepted answer (UTF8-CPP) works for you.
I would however first think about what you're trying to achieve, as there is no way you can just import UTF-8-encoded strings into "Ansi" (what ever you mean by that, maybe something like ISO8859_1 or WIN1252 encoding?).
As there is no-one working on weekends, I have to answer it myself :)
String Utf8ToWinLatin1(char* aData, char* aValue)
{
int i=0;
for(int j=0;j<strlen(aData);)
{ int val=aData[j];
int c=(unsigned char)aData[j];
if(c<=127)
{ aValue[i]=c;
j+=1;
i++;
}
else if(c>=192 && c<=223)
{
aValue[i]=(c-192)*64 + (aData[j+1]-128);
i++;
j+=2;
}
else if(c>=224 && c<=239)
{
aValue[i]=( c-224)*4096 + (aData[j+1]-128)*64 + (aData[j+2]-128);
i++;
j+=3;
}
else if(c>=240 && c<=247)
{
aValue[i]=(c-240)*262144 + (aData[j+1]-128)*4096 + (aData[j+2]-128)*64 + (aData[j+3]-128);
i++;
j+=4;
}
else if(c>=248 && c<=251)
{
aValue[i]=(c-248)*16777216 + (aData[j+1]-128)*262144+ (aData[j+2]-128)*4096 + (aData[j+3]-128)*64 + (aData[j+4]-128);
i++;
j+=5;
}
else
j+=1;
}
return aValue;
}

- 4,523
- 3
- 33
- 48
-
You should know that ASCII only has 128 characters compared to the 1,114,112 Unicode characters that can be encoded with UTF-8. So you will loose all characters that are not in the ASCII charset. – Gumbo Jan 24 '09 at 12:12
-
You're function should be better called something like `Utf8ToWinLatin1()` - `ConvertAnsi` doesn't specify what get's converted to what; also, 'ANSI' isn't a name of any encoding... – Christoph Jan 24 '09 at 12:45
-
I don't care about 1,000,000 characters - I only want my native ones back (ÕÖÄÜ). I called it Ansi, because that's what it is called in Notepad :) when you select SaveAs. – Riho Jan 24 '09 at 20:21
Your question doesn't say specifically which character set you want to convert to. If you only want the basic 7-bit ASCII charset, discarding every character with a higher value than 127 will work.
If you want to convert to a 8-bit character set, such as latin1, you'll have to do it the hard way.

- 243,077
- 51
- 345
- 550
-
-
He didn't ask about conversion to Latin1 though, just to "ANSI" which, well, can mean a lot of things. Of course you're right, if he wants to convert to some specific 8-bit character set (such as latin1) then you're right, this won't work. – jalf Jan 24 '09 at 13:23
-
@jalf: 'ANSI' is a common, incorrect label for Windows-1252 (aka WinLatin1); check wikipedia for details... – Christoph Jan 24 '09 at 13:31
-
"The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community" – Christoph Jan 24 '09 at 13:32
-
Yep, not saying that isn't what he meant, just that if it isn't, and if he only wants the 128 ASCII chars, this is a much simpler solution than his own – jalf Jan 24 '09 at 13:47
-
In any case there will be data loss then there are characters that are not element of the smaller charset. – Gumbo Jan 24 '09 at 17:08
-
Yes, I don't want any cyrillic or Chinese characters, I just need the common Win-1252 symbols out (like öõäü). And it works. – Riho Jan 24 '09 at 20:24