-1

I have a string encoded in windows-1256 and is displayed as ÓíÞÑÕäí áßí ¿. The string should be displayed in Arabic if the operating system is configured to use the encoding.

Here is the HEX representation of the string:

Hex representation

My intention is to convert the text to utf8 manually (using lookup tables to see which bytes need to be altered, and which should be left as-is).

I will need to iterate through all bytes in the string to see the binary value of the byte. The string is printed to the output stream as ÓíÞÑÕäí áßí ¿. The string length is 13 visible characters. But when I try to iterate through the bytes, the loop is run double the size (24) iterations. Maybe it wrongly assumes at UTF8 or UTF16.

How can I access the numerical value of each byte in the string?

#include <iostream>
#include <bitset>

using std::string;
using std::cout;
using std::endl;


int main() {

    string myString = "ÓíÞÑÕäí áßí ¿";  
    // text is written in Windows-1256 encoding

    cout << "string is : " << myString << endl;  
    // outputs: string is : ÓíÞÑÕäí áßí ¿

    cout << "length : " << myString.size() << endl;  
   // outputs : length : 24
    
    for (std::size_t i = 0; i < myString.size(); ++i)
    {
        uint8_t         b1 = (uint8_t)myString.c_str()[i];
        unsigned char   b2 = (unsigned char) myString.c_str()[i];
        unsigned int    b3 = (unsigned int) myString.c_str()[i];
        int             b4 = (int) myString.c_str()[i];

        cout    << i << " - " 
                << std::bitset<8>(myString.c_str()[i]) 
                << " : " << b1   // prints �
                << " : " << b2   // prints �
                << " : " << b3   // prints very large numbers, except for spaces (32)
                << " : " << b4   // negative values, except for the space (32)
                << endl;
    }
    return 0;
}

output

string is : ÓíÞÑÕäí áßí ¿
length : 24

 0 - 11000011 : � : � : 4294967235 : -61
 1 - 10010011 : � : � : 4294967187 : -109
 2 - 11000011 : � : � : 4294967235 : -61
 3 - 10101101 : � : � : 4294967213 : -83
 4 - 11000011 : � : � : 4294967235 : -61
 5 - 10011110 : � : � : 4294967198 : -98
 6 - 11000011 : � : � : 4294967235 : -61
 7 - 10010001 : � : � : 4294967185 : -111
 8 - 11000011 : � : � : 4294967235 : -61
 9 - 10010101 : � : � : 4294967189 : -107
10 - 11000011 : � : � : 4294967235 : -61
11 - 10100100 : � : � : 4294967204 : -92
12 - 11000011 : � : � : 4294967235 : -61
13 - 10101101 : � : � : 4294967213 : -83
14 - 00100000 :   :   : 32 : 32
15 - 11000011 : � : � : 4294967235 : -61
16 - 10100001 : � : � : 4294967201 : -95
17 - 11000011 : � : � : 4294967235 : -61
18 - 10011111 : � : � : 4294967199 : -97
19 - 11000011 : � : � : 4294967235 : -61
20 - 10101101 : � : � : 4294967213 : -83
21 - 00100000 :   :   : 32 : 32
22 - 11000010 : � : � : 4294967234 : -62
23 - 10111111 : � : � : 4294967231 : -65
Ahmad
  • 12,336
  • 6
  • 48
  • 88
  • `uint8_t` is probably a typedef for `unsigned char`, and casting a `char` which is signed 8 bit integer to unsigned integer can indeed produce very large numbers. Try to print the values as signed ints. – pptaszni Dec 09 '22 at 21:37
  • @pptaszni I already tried signed integer. It printed the wrong values – Ahmad Dec 09 '22 at 21:49
  • Post text, not images of text so we can copy the byte data if needed, and the byte data is actually UTF-8 for `'ÓíÞÑÕäí áßí ¿'`. – Mark Tolonen Dec 09 '22 at 22:06
  • You face a [mojibake](https://en.wikipedia.org/wiki/Mojibake) case (*example in Python for its universal intelligibility*): `'سيقرصني لكي ؟'.encode('cp1256').decode('cp1252')` returns mojibake `'ÓíÞÑÕäí áßí ¿'` and vice versa: `'ÓíÞÑÕäí áßí ¿'.encode('cp1252').decode('cp1256')` -> `سيقرصني لكي ؟`…ˇ – JosefZ Dec 09 '22 at 22:07
  • 1
    Your source code is saved as UTF-8, so your string contains UTF-8-encoded `'ÓíÞÑÕäí áßí ¿'`. That's why the length is 24. You'd have to save your source in Windows-1256 and convince your compiler that is the source encoding for your string to be encoded in Windows-1256. – Mark Tolonen Dec 09 '22 at 22:14
  • @JosefZ how can I convert the text from windows1256 to utf8 in c++? – Ahmad Dec 09 '22 at 22:25
  • Your source code is saved as UTF-8, so why don't you use `string myString = "سيقرصني لكي ؟"`? To display it properly, use appropriate code page in `chcp` _before_ you launch your exe… – JosefZ Dec 11 '22 at 16:57
  • @JosefZ never mind. I was able to solve this problem – Ahmad Dec 12 '22 at 05:21

1 Answers1

0

I finally was able to iterate through the string byte by byte using the following code, (which I copied from another answer, but couldn't find the link to it)

// this function receives any std::string 
// and returns a vector<byte> containing the numerical value of each byte in the string
vector<byte> getBytes(string const &s) {
    std::vector<std::byte> bytes;
    bytes.reserve(std::size(s));

    std::transform(std::begin(s), 
                   std::end(s),
                   std::back_inserter(bytes), 
                   [](char const &c){ return std::byte(c);});

    return bytes;
}
Ahmad
  • 12,336
  • 6
  • 48
  • 88