1

My high level goal is to convert any string (can include non-ascii characters) into a vector of integers by converting each character to integer.

I already have a python code snippet for this purpose:

bytes = list(text.encode())

Now I want to have a C++ equivalent. I tried something like

int main() {
  char const* bytes = inputText.c_str();
  long bytesLen = strlen(bytes);
  auto vec = std::vector<long>(bytes, bytes + bytesLen);
  for (auto number : vec) {
      cout << number << endl;
  }
  return 0;
}

For an input string like "testΔ", the python code outputs [116, 101, 115, 116, 206, 148].

However C++ code outputs [116, 101, 115, 116, -50, -108].

How should I change the C++ code to make them consistent?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • "I know how to do this in another language" is not a reason to use that language's tag. – Karl Knechtel Nov 10 '20 at 19:58
  • If you are using unicode characters you cannot use `char*` you'll likely want to use wide characters and unicode literals https://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11 for example Python's `str.encode` uses utf-8 by default, so here's an explanation of utf-8 support in C++ https://stackoverflow.com/questions/50403342/how-do-i-properly-use-stdstring-on-utf-8-in-c – Cory Kramer Nov 10 '20 at 19:59
  • If you're sticking in ASCII space, an unsigned datatype in the vector should help a lot. – user4581301 Nov 10 '20 at 20:01
  • @user4581301 Δ is not in ASCII space. – eerorika Nov 10 '20 at 20:05
  • True enough. I was watching TV the other night and apparently the aliens built the pyramids. Maybe they are responsible for triangles, too. – user4581301 Nov 10 '20 at 20:54
  • std::vector – Martin York Nov 10 '20 at 21:17

3 Answers3

2

However C++ code outputs [116, 101, 115, 116, -50, -108].

In C++, the char type is separate from both signed char and unsigned char, and it is unspecified whether or not it should be signed.

You thus explicitly want an unsigned char*, but the .c_str method gives you char *, so you need to cast. You will need reinterpret_cast or a C-style cast; static_cast will not work.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
1

You can iterate over std::string contents just fine, no need to convert it to std::vector. Try this:

int main()
{
    std::string str = "abc";
    for (auto c : str)
    {
        std::cout << static_cast<unsigned int>(c) << std::endl;
    }
}

static_cast here is needed just because standard operator<< outputs char as it is, not as a number. Otherwise, you can work with it just like with any other integral type. We cast it to unsigned int to ensure that output is strictly positive, for signedness of char is implementation-defined.

jhkouy78reu9wx
  • 342
  • 2
  • 8
0

How should I change the C++ code to make them consistent?

The difference appears to be that Python uses unsigned char values while char is signed in your C++ implementation. One solution: Reinterpret the string as array of unsigned char.

eerorika
  • 232,697
  • 12
  • 197
  • 326