I have a code like that:
#!/usr/bin/env python3
#-*-coding:utf-8-*-
from PyQt4 import QtGui, QtCore
import re
.....
str = self.lineEdit.text() # lineEdit is a object in QtGui.QLineEdit class
# This line thanks to Fedor Gogolev et al from
#https://stackoverflow.com/questions/12214801/print-a-string-as-hex-bytes
print('\\u'+"\\u".join("{:x}".format(ord(c)) for c in str))
# u+20000-u+2a6d6 is CJK Ext B
cjk = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$",re.UNICODE)
if cjk.match(str):
print("OK")
else:
print("error")
when I inputted "敏感詞" (0x654F,0x611F, 0x8A5E in utf16 respectively), the result was:
\u654f\u611f\u8a5e
OK
but when I input "詞" (0x8A5E, 0xD840 0xDC37, 0xD840 0xDC81, 0xD840 0xDC4D in utf-16) in which there were 3 characters from CJK Extention B Area. The result which is not expected is:
\u8a5e\ud840\udc37\ud840\udc81\ud840\udc4d
error
how can I processed these CJK characters with converting to utf-8 to be processed suitabliy with re of Python3?
P.S.
the value from sys.maxunicode is 1114111, it might be UCS-4. Hence, I think that the question seems not to be the same as python regex fails to match a specific Unicode > 2 hex values
another code:
#!/usr/bin/env python3 #-*-coding:utf-8-*- import re CJKBlock = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$") #CJK ext B print(CJKBlock.search('詞'))
returns <_sre.SRE_Match object; span=(0, 4), match='詞'>
#expected result.
even I added
self.lineEdit.setText("詞")
inside__init__
function of the window class and executed it, the word in LineEdit shows appropriately, but when I pressed enter, the result was still "error"version:
- Python3.4.3
- Qt version: 4.8.6
- PyQt version: 4.10.4.