0

I have a proprietary file (database) format which i am currently trying to migrate to a SQL database. Therefore i am converting the files to a sql dump, which is already working fine. The only problem left now is their weird way of handling characters which are not inside the ASCII-decimal range of 32 to 126. They have a collection of all those characters stored in Unicode (hex - e.g. 20AC = €), indexed by their own internal index.

My plan now is: I want to create a table where the internal index, the unicode (in hex) and the character representation (UTF-8) is stored. This table can then be used for future updates.

Now to the problem: How do i write the UTF-8 character representation of a unicode hex value to a file? Current code looks like this:

this->outFile.open(fileName + ".sql", std::ofstream::app);
std::string protyp;
this->inFile.ignore(2); // Ignore the ID = 01.
std::getline(this->inFile, protyp); // Get the PROTYP Identifier (e.g. \321)
protyp = "\\" + protyp;

std::string unicodeHex;
this->inFile.ignore(2); // Ignore the ID = 01.
std::getline(this->inFile, unicodeHex); // Get the Unicode HEX Identifier (e.g. 002C)

std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
const std::wstring wide_string = this->s2ws("\\u" + unicodeHex);
const std::string utf8_rep = converter.to_bytes(wide_string);

std::string valueString = "('" + protyp + "', '" + unicodeHex + "', '" + utf8_rep + "')";

this->outFile << valueString << std::endl;

this->outFile.close();

But this just prints out something like this:

('\321', '002C', '\u002C'),

While the desired output would be:

('\321', '002C', ','),

What am i doing wrong? I have to admit that i am not that certain when it comes to character encoding and stuff :/. I am working on Windows 7 64bit, if it makes any difference. Thanks in advance.

puelo
  • 5,464
  • 2
  • 34
  • 62
  • The conversion from `\u002C` to a wide-character value occurs at compile time, not run time. You need to forget about the `\u` and do a string to integer conversion. – Mark Ransom Apr 06 '15 at 15:27
  • It works! Thanks a lot. You can add your comment as an answer if you want. I will accept it as soon as i can. I will also add some code as a second answer to present the solution i've come up with. – puelo Apr 06 '15 at 15:36
  • My goal was to give you enough information to produce the answer yourself, and it seems I succeeded. Your sincere thanks is all the reward I need. – Mark Ransom Apr 06 '15 at 15:39

1 Answers1

1

As @Mark Ransom pointed out in the comments my best bet was to convert the hex string to an integer and use it. This is what i did:

unsigned int decimalHex = std::stoul(unicodeHex, nullptr, 16);;

std::string valueString = "('" + protyp + "', '" + unicodeHex + "', '" + this->UnicodeToUTF8(decimalHex) + "')";

While the function for the UnicodeToUTF8 was taken from here Unsigned integer as UTF-8 value

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;

    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}
Community
  • 1
  • 1
puelo
  • 5,464
  • 2
  • 34
  • 62