0

Intro

I have some input that I need to convert to the correct Chinese characters but I think I'm stuck at the final number to string conversion. I have checked using this hex to text converter online tool that e6b9af corresponds to the text .

MWE

Here is a minimal example that I made to illustrate the problem. The input is "%e6%b9%af" (obtained from an URL somewhere else).

#include <iostream>
#include <string>

std::string attempt(std::string path)
{
  std::size_t i = path.find("%");
  while (i != std::string::npos)
  {
    std::string sub = path.substr(i, 9);
    sub.erase(i + 6, 1);
    sub.erase(i + 3, 1);
    sub.erase(i, 1);
    std::size_t s = std::stoul(sub, nullptr, 16);
    path.replace(i, 9, std::to_string(s));
    i = path.find("%");
  }
  return path;
}

int main()
{
  std::string input = "%E6%B9%AF";
  std::string goal = "湯";

  // convert input to goal
  input = attempt(input);
  
  std::cout << goal << " and " << input << (input == goal ? " are the same" : " are not the same") << std::endl;

  return 0;
}

Output

湯 and 15120815 are not the same

Expected output

湯 and 湯 are the same

Additional question

Are all characters in foreign languages represented in 3 bytes or is that just for Chinese? Since my attempt assumes blocks of 3 bytes, is that a good assumption?

Community
  • 1
  • 1
LinG
  • 307
  • 3
  • 11
  • The bytes `E6 B9 AF` are the **UTF-8 encoding** of the character you posted here. A more correct implementation would undo the URL-encoding first, and then UTF-8 decode as necessary. If you are just going to output it to a processor that expects UTF-8, you only need to URL-decode. As for your last question, see ["The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets"](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – Botje Oct 17 '19 at 09:14
  • You're using `to_string` to convert a number into the string representation in base 10. What you _actually_ want is to convert that number into a single character. Since you've converted it as a 2-digit hex value, it's guaranteed to be in the correct range, so just cast it to `char` and stick it in the string. – paddy Oct 17 '19 at 09:42
  • @Botje thank you for the suggestion and the good read :) – LinG Oct 17 '19 at 10:43

1 Answers1

0

Based on your suggestions and changing an example from this other post. This is what I came up with.

#include <iostream>
#include <string>
#include <sstream>

std::string decode_url(const std::string& path)
{
  std::stringstream decoded;
  for (std::size_t i = 0; i < path.size(); i++)
  {
    if (path[i] != '%')
    {
      if (path[i] == '+')
        decoded << ' ';
      else
        decoded << path[i];
    }
    else
    {
      unsigned int j;
      sscanf(path.substr(i + 1, 2).c_str(), "%x", &j);
      decoded << static_cast<char>(j);
      i += 2;
    }
  }
  return decoded.str();
}

int main()
{
  std::string input = "%E6%B9%AF";
  std::string goal = "湯";

  // convert input to goal
  input = decode_url(input);

  std::cout << goal << " and " << input << (input == goal ? " are the same" : " are not the same") << std::endl;

  return 0;
}

Output

湯 and 湯 are the same

LinG
  • 307
  • 3
  • 11