1

My application is developed in C++'11 and uses Qt5. In this application, I need to store a UTF-8 text as Windows-1250 coded file. I tried two following ways and both work expect for Romanian 'ș' and 'ț' characters :(

    1.
    auto data = QStringList() << ... <some texts here>;
    QTextStream outStream(&destFile);
    outStream.setCodec(QTextCodec::codecForName("Windows-1250"));
    foreach (auto qstr, data)
    {
        outStream << qstr << EOL_CODE;
    }
    2.
    auto data = QStringList() << ... <some texts here>;
    auto *codec = QTextCodec::codecForName("Windows-1250");
    foreach (auto qstr, data)
    {
        const QByteArray encodedString = codec->fromUnicode(qstr);
        destFile.write(encodedString);
    }

In case of 'ț' character (alias 0xC89B), instead of expected 0xFE value, the character is coded and stored as 0x3F, that it is unexpected.

So I am looking for any help or experience / examples regarding text recoding.

Best regards,

avf
  • 61
  • 1
  • 8
  • Did you check that `codecForName` is not returning 0? By the way, you can also [pass a string to `setCodec` directly](https://doc.qt.io/qt-5/qtextstream.html#setCodec-1). – Thomas Jun 08 '20 at 13:36
  • Hi @Thomas. No, codecForName does not return nullptr and other Romanian characters are correctly converted... Only 'ș' and 'ț' exhibit this strange behavior. I tried to pass a string to setCodec method and the behavior is the same – avf Jun 08 '20 at 13:47
  • 1
    0xFE in Windows-1250 is "t with cedilla", U+0163. You have U+021B, "t with comma below". It's not representable in codepage 1250, so the conversion produces the question mark, 0x3F. – Igor Tandetnik Jun 08 '20 at 14:03

1 Answers1

1

Do not confuse ț with ţ. The former is what is in your post, the latter is what's actually supported by Windows-1250.

The character ț from your post is T-comma, U+021B, LATIN SMALL LETTER T WITH COMMA BELOW, however:

This letter was not part of the early Unicode versions, which is why Ţ (T-cedilla, available from version 1.1.0, June 1993) is often used in digital texts in Romanian.

The character referred to is ţ, U+0163, LATIN SMALL LETTER T WITH CEDILLA (emphasis mine):

In early versions of Unicode, the Romanian letter Ț (T-comma) was considered a glyph variant of Ţ, and therefore was not present in the Unicode Standard. It is also not present in the Windows-1250 (Central Europe) code page.

The story of ş and ș, being S-cedilla and S-comma is analogous.

If you must encode to this archaic Windows 1250 code page, I'd suggest replacing the comma variants by the cedilla variants (both lowercase and uppercase) before encoding. I think Romanians will understand :)

Thomas
  • 174,939
  • 50
  • 355
  • 478
  • Thank you for your help, I will check the appropriate letters to use with people who translates the data. – avf Jun 08 '20 at 15:08