21

I'm trying to create a UTF-8 coded file in Qt.

#include <QtCore>

int main()
{
    QString unicodeString = "Some Unicode string";
    QFile fileOut("D:\\Temp\\qt_unicode.txt");
    if (!fileOut.open(QIODevice::WriteOnly | QIODevice::Text))
    {
        return -1;
    }

    QTextStream streamFileOut(&fileOut);
    streamFileOut.setCodec("UTF-8");
    streamFileOut << unicodeString;
    streamFileOut.flush();

    fileOut.close();

    return 0;
}

I thought when QString is by default Unicode and when I set codec of the output stream to UTF-8 that my file will be UTF-8. But it's not, it's ANSI. What do I do wrong? Is something wrong with my strings? Can you correct my code to create UTF-8 file? Next step for me will be to read ANSI file and save it as UTF-8 file, so I'll have to perform a conversion on each read string but now, I want to start with a file. Thank you.

Martin Hennings
  • 16,418
  • 9
  • 48
  • 68
Ondrej Vencovsky
  • 3,188
  • 9
  • 28
  • 34
  • 1
    You should convert the string literal to a string with QString::fromUtf8(). Also, some compilers have problems with non-ascii encodings in source files (MSVC). So maybe also try if it works when entering the string via e.g. QInputDialog. I also suggest to define QT_NO_CAST_FROM_ASCII and QT_NO_CAST_TO_ASCII when encountering issues like this. It disables implicit conversions and thus makes it clearer what's going on. – Frank Osterfeld Jan 24 '11 at 10:03
  • http://stackoverflow.com/questions/29485602/qt-convert-unicode-entites – trante Apr 07 '15 at 10:14

3 Answers3

19

2022 edit: what follows was true for Qt 4. Qt 5 and later use UTF-8 by default, so this answer doesn’t apply to the latest Qt versions.

Your code is absolutely correct. The only part that looks suspicious to me is this:

QString unicodeString = "Some Unicode string";

The reason it looks suspicious is that QString uses the Latin1 encoding by default when constructing from a C-style string literal, so if you just intend to use accented Latin characters, you're probably fine, but use anything but that (Cyrillic, Chinese, Japanese, Hebrew...) and it no longer works correctly. The best way to deal with this issue is to have your source encoded in UTF-8 and do this instead:

QString unicodeString = QString::fromUtf8("Some Unicode string");

This will work for any imaginable language. Using QObject::trUtf8() is even better as it gives you a lot of i18n capabilities.

Edit

While it's true that you generate a correct UTF-8 file, if you want Notepad to recognize your file as UTF-8, it's a different story. You need to put a BOM in there. It can be done either as suggested in another answer, or here is another way:

streamFileOut.setGenerateByteOrderMark(true);
Sergei Tachenov
  • 24,345
  • 8
  • 57
  • 73
  • 1
    I wouldn't recommend keeping C++ source in UTF-8 :) – Piotr Dobrogost Jan 24 '11 at 10:14
  • 1
    @Piotr, why? UTF-8 (with no BOM) is an encoding that is perfectly compatible with US-ASCII and supports any language. How else can you use character literals in some native language, without resorting to QTextStream::setCodecForCStrings() which can lead to a whole lot of problems? – Sergei Tachenov Jan 24 '11 at 10:19
  • @Sergey. I have to agree with Piotr. The problem is when you do have non-ASCII literals in the source file, it's up to the mercy of pre-processors and compilers to try not to mangle them. I have no doubt most modern tools can handle it. But why leave it to chance? – Stephen Chu Jan 24 '11 at 13:03
  • @Stephen, I agree that it may lead to problems, but not only in real life it doesn't, but also what are alternatives? If English is the main language of the program's interface and source code comments, it is possible to have sources in US-ASCII only. But what if it isn't? I develop software for Russian specialists in a team of Russian developers some of which don't even speak English fluently. What choice do I have? My point is, if non-ASCII characters are needed in sources, the best choice for encoding is UTF-8. – Sergei Tachenov Jan 24 '11 at 13:48
  • Well, the source code is not UTF-8, I also don't like that idea much. I just want to create UTF-8 file with "Hello" inside using Qt and I don't know how. In QString documentation I can read that QString str = "Hello"; creates Unicode string - but it seems not, it probably doesn't convert it from ANSI source to Unicode. I don't know Qt, I'm trying to get to know it, usually I program in .NET and there declaring OutputStream with UTF-8 as codec parameter is enough to get resulting UTF-8 file, despite what I send in. Now Qt creates a file, but when I open it in Notepad, it says it's ANSI. – Ondrej Vencovsky Jan 24 '11 at 14:48
  • @Ondrej, as I've said, your code is correct. If you just write "Hello", it is the same in both ANSI and UTF-8 so you (or Notepad) can't tell the difference, unless you also write a BOM as someone suggested already. – Sergei Tachenov Jan 24 '11 at 14:57
  • @Ondrej, also, `QString str = "Hello";` does convert from Latin1 (probably it's what ANSI on your system is) by default, unless you change it with QTextStream::setCodecForCStrings(). – Sergei Tachenov Jan 24 '11 at 15:05
  • @Sergey: Aaaaah - BOM! That's exactly it. Thank you Sergey for your time. – Ondrej Vencovsky Jan 24 '11 at 16:08
  • 1
    @Ondrej, note that some software may not like the BOM, especially that wasn't designed to support Unicode in the first place. And it's still valid UTF-8 even without it, so it's up to you whether to put it there or not. In the end it depends on how you are planning to use the generated files. – Sergei Tachenov Jan 24 '11 at 16:46
  • 1
    BOM has no meaning for UTF-8 files and is a Microsoft-ism. – koan Oct 18 '13 at 09:59
  • Downvote for unnecessarily condescending language: "You do realize, ... don't you?" I would hope *you* realise people come to SO to learn things we didn't know already. Please try to avoid using this kind of wording, it's not very nice. – CJBrew May 17 '22 at 20:43
  • @CJBrew, that’s fair. It wasn’t my intention, but even though English is not my native language, I think it was good enough in 2011 for me to notice this. Will try to reword in a more polite way. – Sergei Tachenov May 19 '22 at 09:33
11

My experience to create txt encoding UTF-8 without BOM by QT as:

file.open(QIODevice::WriteOnly | QIODevice::Text);
QTextStream out(&file);
out.setCodec("UTF-8"); // ...
vcfline = ctn; //assign some utf-8 characters
out.setGenerateByteOrderMark(false);
out << vcfline; //.....
file.close();

And the file will be encoding UTF-8 without BOM.

user2006121
  • 111
  • 1
  • 2
7

Don't forget that UTF-8 encoding will encode ASCII characters as one byte. Only special or accentuated characters will be encoded with more bytes (from 2 to 6 bytes).

This means as long as you have ASCII characters (which is the case of your unicodeString), the file will only contain 8 bytes characters. Thus, you get backward compatibility with ASCII :

UTF-8 can represent every character in the Unicode character set, but unlike them, possesses the advantages of being backward-compatible with ASCII

To check if your code is working, you should put for instance some accentuated characters in your unicode.

I tested your code with accentuated characters, and it's working fine.

If you want to have a BOM at the beginning of your file, you could start by adding the BOM character (QChar(QChar::ByteOrderMark)).

lesenk
  • 793
  • 1
  • 8
  • 22
Jérôme
  • 26,567
  • 29
  • 98
  • 120
  • Thank you Jerome, you helped me with a BOM. File was really OK, but BOM was missing. I use the Sergey's way to add it to the stream, but your help is very appreciated. – Ondrej Vencovsky Jan 24 '11 at 16:09