How to convert CJK Extention B in QLineEdit of Python3-PyQt4 to utf-8 to Processing it with regex

Question

I have a code like that:

#!/usr/bin/env python3
#-*-coding:utf-8-*-
from PyQt4 import QtGui, QtCore
import re
.....
str = self.lineEdit.text() # lineEdit is a object in QtGui.QLineEdit class

# This line thanks to Fedor Gogolev et al from 
#https://stackoverflow.com/questions/12214801/print-a-string-as-hex-bytes

print('\\u'+"\\u".join("{:x}".format(ord(c)) for c in str))
# u+20000-u+2a6d6 is CJK Ext B
cjk = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$",re.UNICODE) 

if cjk.match(str):
    print("OK")
else:
    print("error")

when I inputted "敏感詞" (0x654F,0x611F, 0x8A5E in utf16 respectively), the result was:

\u654f\u611f\u8a5e
OK

but when I input "詞" (0x8A5E, 0xD840 0xDC37, 0xD840 0xDC81, 0xD840 0xDC4D in utf-16) in which there were 3 characters from CJK Extention B Area. The result which is not expected is:

\u8a5e\ud840\udc37\ud840\udc81\ud840\udc4d
error

how can I processed these CJK characters with converting to utf-8 to be processed suitabliy with re of Python3?

P.S.

the value from sys.maxunicode is 1114111, it might be UCS-4. Hence, I think that the question seems not to be the same as python regex fails to match a specific Unicode > 2 hex values

another code:

#!/usr/bin/env python3
#-*-coding:utf-8-*-
import re
CJKBlock = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$") #CJK ext B
print(CJKBlock.search('詞'))

returns <_sre.SRE_Match object; span=(0, 4), match='詞'> #expected result.

even I added self.lineEdit.setText("詞") inside __init__ function of the window class and executed it, the word in LineEdit shows appropriately, but when I pressed enter, the result was still "error"
version:
- Python3.4.3
- Qt version: 4.8.6
- PyQt version: 4.10.4.

Why are you using a narrow-build of python-3.4? Since [PEP-0393](http://www.python.org/dev/peps/pep-0393/), there is no longer any advantage in doing that. Your code fails because the non-BMP characters have to be represented as surrogate pairs. If you switch to a wide-build of python-3.4, this problem will go away. — ekhumoro, Apr 02 '16 at 16:48
Re. @ekhumoro the returned value of sys.maxunicode is 1114111, it might be in UCS-4. — Tan Kian-teng, Apr 02 '16 at 17:36
#!/usr/bin/env python3 #-*-encoding:utf-8-*- import re; CJKBlock = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$"); #CJK ext B print(CJKBlock.search('詞')); returns <_sre.SRE_Match object; span=(0, 4), match='詞'> #expected result. but the string from QLineEdit fails to do that. Maybe It's the problem from Python3-PyQt4? — Tan Kian-teng, Apr 02 '16 at 17:42
I cannot reproduce this on a wide-build: it prints `\u8a5e\u20037\u20081\u2004d` as expected (i.e. no surrogate pairs). What **specific** version of pyqt4 are you using, and where did you get it from? How are you entering the text into the line-edit? Does it make any difference if enter it in code, using `setText()`? — ekhumoro, Apr 02 '16 at 18:01
[1]Qt version: 4.8.6, PyQt version: 4.10.4. [2]I get it from Linux Mint Linux Mint 17.3 64-bit repo. [3] Key in the text in lineEdit, and `self.lineEdit.returnPressed.connect(the_function)`[4]even I added self.lineEdit.setText("詞") in __init__ function of the window and executed it, the word in LineEdit shows appropriately, but when I pressed enter, the result was still "error". — Tan Kian-teng, Apr 02 '16 at 18:33

score 0 · Accepted Answer · answered Apr 02 '16 at 19:22

There were a few PyQt4 bugs following the implemetation of PEP-393 that can affect conversions between QString and python strings. If you use sip to switch to the v1 API, you should probably be able to confirm that the QString returned by the line-edit does not contain surrogate pairs. But if you then convert it to a python string, the surrogates should appear.

Here is how to test this in an interactive session:

>>> import sip
>>> sip.setapi('QString', 1)
>>> from PyQt4 import QtGui
>>> app = QtGui.QApplication([])
>>> w = QtGui.QLineEdit()
>>> w.setText('詞')
>>> qstr = w.text()
>>> qstr
PyQt4.QtCore.QString('詞')
>>> pystr = str(qstr)
>>> print('\\u' + '\\u'.join('{:x}'.format(ord(c)) for c in pystr))
\u8a5e\u20037\u20081\u2004d

Of course, this last line does not show surrogates for me, because I cannot do the test with PyQt-4.10.4. I have tested with PyQt-4.11.1 and PyQt-4.11.4, though, and I did not get see any problems. So you should try to upgrade to one of those.

As I update to python3-pyqt4 4.11.4 and python3-sip 4.16.9, the question has been solved. Thank you a lot. — Tan Kian-teng, Apr 02 '16 at 19:45

How to convert CJK Extention B in QLineEdit of Python3-PyQt4 to utf-8 to Processing it with regex

1 Answers1