11

I have some strings read from the database, stored in a char* and in UTF-8 format (you know, "á" is encoded as 0xC3 0xA1). But, in order to write them to a file, I first need to convert them to ANSI (can't make the file in UTF-8 format... it's only read as ANSI), so that my "á" doesn't become "á". Yes, I know some data will be lost (chinese characters, and in general anything not in the ANSI code page) but that's exactly what I need.

But the thing is, I need the code to compile in various platforms, so it has to be standard C++ (i.e. no Winapi, only stdlib, stl, crt or any custom library with available source).

Anyone has any suggestions?

  • Can't you just write such a function? [Here](http://www.alanwood.net/demos/ansi.html) is a translation table. – Carl Norum Jul 10 '13 at 05:04
  • 2
    @curiousguy: A Microsoft misunderstanding. The question is a bit troublesome, though. "compile in various platforms" should not be necessary. The ANSI stuff should be quarantined on the originating Microsoft system. You can't even reliably send it from one Windows machine to another. – MSalters Jul 10 '13 at 13:36
  • 2
    Great comment, indeed. To anyone reading, please do what MSalters says. Even more, please make all your programs output text in the UTF-8 encoding. But unfortunately, that is not my case. I need a program that will not run on Windows generate a file that will be read on Windows (it's not even a plain text file, and the library that generates it doesn't allow me to change the file encoding). Thus the need for the code to compile in various platforms, and to convert from UTF-8 to ANSI. – José Ernesto Lara Rodríguez Jul 16 '13 at 23:17
  • Perhaps it is not ANSI encoding by ASCII encoding – Basile Starynkevitch Jan 02 '19 at 11:52

4 Answers4

15

A few days ago, somebody answered that if I had a C++11 compiler, I could try this:

#include <string>
#include <codecvt>
#include <locale>

string utf8_to_string(const char *utf8str, const locale& loc)
{
    // UTF-8 to wstring
    wstring_convert<codecvt_utf8<wchar_t>> wconv;
    wstring wstr = wconv.from_bytes(utf8str);
    // wstring to string
    vector<char> buf(wstr.size());
    use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
    return string(buf.data(), buf.size());
}

int main(int argc, char* argv[])
{
    string ansi;
    char utf8txt[] = {0xc3, 0xa1, 0};

    // I guess you want to use Windows-1252 encoding...
    ansi = utf8_to_string(utf8txt, locale(".1252"));
    // Now do something with the string
    return 0;
}

Don't know what happened to the response, apparently someone deleted it. But, turns out that it is the perfect solution. To whoever posted, thanks a lot, and you deserve the AC and upvote!!

  • I'm testing this code for Chinese utf8 string to ANSI(GBK) string (Windows 7 Chinese edition, and Visual studio 2015 C++), but it failed, instead, this answer https://stackoverflow.com/a/35272822/154911 works. – ollydbg23 Apr 09 '18 at 05:17
  • @ollydbg23 The answer you are pointing to uses Windows APIs, I specifically state "No Winapi" in the question ;). Also, my question states that I'm ok with losing data that's not in the Windows-1252 code page (and specifically make the example of Chinese characters), so I believe this wasn't the answer you were looking for. But thanks for the pointer! It could help people with your question to get an answer that suits their needs. – José Ernesto Lara Rodríguez Apr 10 '18 at 23:12
  • hi, sorry that I dont see the "no winapi" words. About your code in this answer, I am not sure what chars are not in Windows1252 code page. I tested with quite simple Chinese chars, and it just does not work. – ollydbg23 Apr 11 '18 at 11:01
1

If you mean ASCII, just discard any byte that has bit 7 set, this will remove all multibyte sequences. Note that you could create more advanced algorithms, like removing the accent from the "á", but that would require much more work.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
  • Removing the accent may be free, depending on the kind of UTF-8 you have. If it's decomposed, the accent is represented as 2 bytes following the "a", both of which are >0x80. – MSalters Jul 10 '13 at 13:38
  • No, not exactly. See, I want to discard all characters that are not on the ANSI codepage (or replace them with anything, like "?" or "_"), but the characters that are on the ansi codepage (a.k.a. codepage 1252) I need them converted, not lost. – José Ernesto Lara Rodríguez Jul 16 '13 at 23:14
  • "No, not exactly" what? I'm not sure what you are objecting to here. – Ulrich Eckhardt Jul 17 '13 at 05:24
  • @UlrichEckhardt Your solution discards all multibyte sequences. My objection is that I want to keep some of them: the ones that have an equivalent in codepage 1252. Codepage 1252 includes the ASCII characters and some extended ones above 127, like accented vowels and "ñ". Sorry if I didn't make that clear in my previous comment (and the question), and thanks for answering! – José Ernesto Lara Rodríguez Apr 10 '18 at 23:18
0

This should work:

#include <string>
#include <codecvt>

using namespace std::string_literals;

std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
  std::u32string wstr(str.size(), U'\0');
  std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
  return wcvt{}.to_bytes(wstr.data(),wstr.data() + wstr.size());
}

std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
  auto wstr = wcvt{}.from_bytes(str);
  std::string result(wstr.size(), '0');
  std::use_facet<std::ctype<char32_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', &result[0]);
  return result;
}

int main() {
  auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
  auto s1 = from_utf8(s0);
  auto s2 = to_utf8(s1);

  return 0;
}

For VC++:

#include <string>
#include <codecvt>

using namespace std::string_literals;

std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
  std::u32string wstr(str.size(), U'\0');
  std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
  return wcvt{}.to_bytes(
    reinterpret_cast<const int32_t*>(wstr.data()),
    reinterpret_cast<const int32_t*>(wstr.data() + wstr.size())
  );
}

std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
  using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
  auto wstr = wcvt{}.from_bytes(str);
  std::string result(wstr.size(), '0');
  std::use_facet<std::ctype<char32_t>>(loc).narrow(
    reinterpret_cast<const char32_t*>(wstr.data()),
    reinterpret_cast<const char32_t*>(wstr.data() + wstr.size()),
    '?', &result[0]);
  return result;
}

int main() {
  auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
  auto s1 = from_utf8(s0);
  auto s2 = to_utf8(s1);

  return 0;
}
cdycdr
  • 1
  • 2
0
#include <stdio.h>
#include <string>
#include <codecvt>
#include <locale>
#include <vector>

using namespace std;
std::string utf8_to_string(const char *utf8str, const locale& loc){
    // UTF-8 to wstring
    wstring_convert<codecvt_utf8<wchar_t>> wconv;
    wstring wstr = wconv.from_bytes(utf8str);
    // wstring to string
    vector<char> buf(wstr.size());
    use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
    return string(buf.data(), buf.size());
}

int main(int argc, char* argv[]){
    std::string ansi;
    char utf8txt[] = {0xc3, 0xa1, 0};

    // I guess you want to use Windows-1252 encoding...
    ansi = utf8_to_string(utf8txt, locale(".1252"));
    // Now do something with the string
    return 0;
}
Mehdi Mohammadpour
  • 1,030
  • 6
  • 11