2

I want to read and write unicode characters from a PyQt5 PlainTextEdit.

It has a very weird issue, which only came to light after a bit of trying and it is the following:

If I enter the String:

yóuxiāngdìzhǐ

into the PlainTextEdit and use the method (by clicking on a button):

userInput = self.rightTextEdit.toPlainText()

it gives me the String:

yóuxingdìzhÐ

Which is obviously messed up. However, if I only change the first ó into an o it suddenly doesn't have a problem anymore:

input: youxiāngdìzhǐ
after method call: youxiāngdìzhǐ

So I guess Qt5 tries some magic behind the scenes and it fails to guess the encoding (why does it try to guess anyways, wouldn't it be better to require the developer to choose an encoding?). Maybe it only ready some characters, or maybe it thinks the ó is such an unusual character, that the encoding needs to be changed completely.

Since Qt5 doesn't have any of the QString methods anymore, how am I supposed to tell a PlainTextEdit, that I want the whole thing interpreted as a unicode String?

I read this question: Set Qt default encoding to UTF-8 , but the answer marked as solving the problem only solves it for Qt4, while Qt5 doesn't have the methods anymore.

Here are the important parts of my source code:

from PyQt5.QtCore import *
from PyQt5.QtWidgets import *
...

class PinyinTransformerMainWindow(QMainWindow):

    def createControls(self):
        ...
        self.rightTextEdit = QPlainTextEdit('', self)
        self.rightTransformButton = QPushButton('Transform (numbers)')
        ...

    def addControlsEventHandlers(self):
        self.leftTransformButton.clicked.connect(self.transformToPinyinWithTones)
        self.rightTransformButton.clicked.connect(self.transformToPinyinWithNumbers)

    def transformToPinyinWithNumbers(self):
        userInput = self.rightTextEdit.toPlainText()
        print("User input right:", userInput)
        ...

EDIT #1:

I've written tests like this:

tonedText = "yóuxiāngdìzhǐ"
numberedText = "you2xiang1di4zhi3"
self.assertEquals(self.pinyin_tones_2_numbers_transformer.transform(tonedText), numberedText)

This test uses the transform method which is the same method I am using in the function o which a button click is connected in the PyQt5 GUI and it runs without failing. This means the error must be in the GUI, where I get the String from the PlainTextEdit.

When I enter in a python console:

>>> a = "yóuxiāngdìzhǐ".encode(encoding="utf-8")
>>> a
b'y\xc3\xb3uxi\xc4\x81ngd\xc3\xaczh\xc7\x90'
>>> a.decode()
'yóuxiāngdìzhǐ'
>>> a.decode(encoding="utf-8")
'yóuxiāngdìzhǐ'

So it's not python3 problem. However, if I do this in the code:

self.leftTextEdit.toPlainText().encode('utf-8').decode('utf-8')

I get the wrong String:

yóuxingdìzhÐ

EDIT #2:

I've now added another print() like this:

print("Condition:", self.leftTextEdit.toPlainText().encode('utf-8').decode('utf-8') == "yóuxiāngdìzhǐ")

and then entered

yóuxiāngdìzhǐ

in the PlainTextEdit. This results in:

False

(!) So it really seems like there is an error in the Qt5 interpretation of the String in the PlainTextEdit. What can I do about it?

EDIT 3: Python Version: 3.4 PyQt5 Version: 5.2.1 Locale used: ('en_US', 'UTF-8')

Community
  • 1
  • 1
Zelphir Kaltstahl
  • 5,722
  • 10
  • 57
  • 86
  • Please show the specific versions of Qt5, PyQt5, and Python you are using, and also state which platform you are on. It might also be worth testing with Qt4/PyQt4 to see if you can reproduce the problem. In particular, see if using `QString` in PyQt4 makes a difference (see [here](http://pyqt.sourceforge.net/Docs/PyQt4/incompatible_apis.html) for how to do this if you are using Python 3). – ekhumoro Dec 27 '14 at 18:22
  • PS: If you do `self.leftTextEdit.setPlainText('yóuxiāngdìzhǐ')`, does this display the text correctly? – ekhumoro Dec 27 '14 at 18:41
  • Python 3.4, In PyQt5 I don't know how to get the exact version, but I downloaded it only a few days ago, so it's probably up to date. I am using a Xubuntu derivate called Voyager, and it's 64-bit. About settings the text: Yes works fine. – Zelphir Kaltstahl Dec 28 '14 at 21:44
  • You can get the exact versions with `QtCore.PYQT_VERSION_STR` and `QtCore.QT_VERSION_STR`. It seems most likely that this is a PyQt5 bug (it's almost certainly not a Qt bug). There have been a number of issues regarding QString to Python unicode object conversion (but not vice versa), and this seems to tally well with what you are seeing. It would help if you could confirm that you don't see similar problems when using PyQt4, though. And it would also be useful to know what your locale settings are. – ekhumoro Dec 29 '14 at 00:21
  • Ah thanks, I am using version `5.2.1` And here I thought I'd do something good by using the newest version there is and learning to use that : / Compiling all that stuff and investing time to learn about how to meet all the necessary requirements for compiling PyQt5 … After doing this: `import locale` and this `locale.getlocale()` I get `('en_US', 'UTF-8')`, which I expected since my system is set to English. – Zelphir Kaltstahl Dec 29 '14 at 00:44
  • I only have one class for one window, so my GUI is not a lot of code and I managed to write the whole thing in GTK+3 (every string is considered unicode and it works), but the resizing seems to be handled in a strange way and there was a lot of trial and error, before I got the expanding of components right. In PyQt5 it seemed easier and there is also, that nice concept of spacers. They probably safe a lot of alignment if more GUI code. It would be nice to get it working using PyQt5. In PyQt4 I'll have to use some QString stuff to get it working I assume. – Zelphir Kaltstahl Dec 29 '14 at 00:49
  • 1
    That version of PyQt5 is quite old - the latest version is 5.4, but you'd also need Qt-5.4 for that. However, if you can updgrade to at least [PyQt-5.3.2](http://sourceforge.net/projects/pyqt/files/PyQt5/PyQt-5.3.2/), I'm fairly confident that will solve your problem. – ekhumoro Dec 29 '14 at 01:02
  • Thanks for all your help! Might try that soon. – Zelphir Kaltstahl Dec 29 '14 at 02:08

1 Answers1

2

UPDATE:

It's very likely that your problem is actually due to a bug in the version of PyQt5 you are using. Upgrading to at least PyQt-5.3.2 will very likely fix it.


There is no problem in Qt, which handles everything correctly.

You can easily verify this for yourself in an interactive session:

>>> from PyQt5 import QtWidgets
>>> app = QtWidgets.QApplication([''])
>>> w = QtWidgets.QPlainTextEdit()
>>> s = 'yóuxiāngdìzhǐ'
>>> w.setPlainText(s)
>>> w.toPlainText().encode('utf-8')
b'y\xc3\xb3uxi\xc4\x81ngd\xc3\xaczh\xc7\x90'
s.encode('utf-8')
b'y\xc3\xb3uxi\xc4\x81ngd\xc3\xaczh\xc7\x90'
>>> w.toPlainText().encode('utf-8') == s.encode('utf-8')
True

The only real problem may occur when you attempt to print the text:

>>> print(s)
yóuxiāngdìzhǐ

This gives the expected output for me, because the stdout encoding matches my console's encoding, and also my console's font contains all the necessary characters. But if your program is attempting to print to a console that hasn't been configured properly (or which just can't handle unicode very well), then you will very likely see mangled output of one kind or another.

ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • Ah thanks, so you are using `.encoding('utf-8')` to do that … But it doesn't handle it correctly unless I call that function, right? However, now Python's internal String handling seems to be different. it now uses this hexadecimal String `b'y\xc3\xb3uxi\xc4\x81ngd\xc3\xaczh\xc7\x90'` instead of the characters for which it actually stands. How can I get a normal unicode String back from that? I see there is the module binascii with unhexlify, but I don't have ascii input I have unicode input. – Zelphir Kaltstahl Dec 27 '14 at 09:43
  • I also found this post: http://stackoverflow.com/questions/6773270/python-convert-unicode-hex-string-to-unicode But the "solving" answer doesn't give me the not-hexadecimal string I want. – Zelphir Kaltstahl Dec 27 '14 at 09:44
  • Updated my post with relevant tests. – Zelphir Kaltstahl Dec 27 '14 at 10:27
  • Also I tried the exact same in the python console and it does not give me `True` but `False`. When I enter those lines you posted, I already get different Strings when I enter both of the encode('utf-8'). `>>> from PyQt5 import QtWidgets >>> app = QtWidgets.QApplication(['']) >>> w = QtWidgets.QPlainTextEdit() >>> s = 'yóuxiāngdìzhǐ' >>> w.setPlainText(s) >>> w.toPlainText().encode('utf-8') b'y\xc3\xb3uxi\x01ngd\xc3\xaczh\xc3\x90' >>> s.encode('utf-8') b'y\xc3\xb3uxi\xc4\x81ngd\xc3\xaczh\xc7\x90' >>> w.toPlainText().encode('utf-8') == s.encode('utf-8') False >>> ` – Zelphir Kaltstahl Dec 27 '14 at 17:29