0

need help on python regex, I have a string contains Chinese and English, I would like to remove white space between Chinese characters but not between English words.

from -- "u'\u5c0f \u5973 \u4eca \u5e74 \u4fc2 dse \u8003 \u751f \u5979 \u559c \u6b61 filmtv \u524d \u5e7e \u65e5 in \u5de6 buasso-filmtv and digital media studies \u5df2 \u7d93 condition offer \u4f46 \u60f3 \u554f \u5982 \u679c through jupas openu \u6536 \u5979 \u8b80 bachelor of arts with honours In creative writing and filmarts"

to -- "u'\u5c0f\u5973\u4eca\u5e74\u4fc2 dse \u8003\u751f\u5979\u559c\u6b61 filmtv \u524d\u5e7e\u65e5 in \u5de6 buasso-filmtv and digital media studies \u5df2\u7d93 condition offer \u4f46\u60f3\u554f\u5982\u679c through jupas openu \u6536\u5979\u8b80 bachelor of arts with honours In creative writing and filmarts"

only remove white space when it's between two unicode characters

  • There doesn't seem to be good built-in function to discover the Unicode block, or good support for Unicode in python re. I guess you should use [regex](https://pypi.python.org/pypi/regex/) package for more specific handling of Unicode (in this case, you might want to use Unicode script or Unicode block). Otherwise, you will have to list the Unicode blocks manually in the regex. – nhahtdh Mar 30 '17 at 13:52
  • yes, python re function does not differentiate Chinese or English, they are all in unicode, so can't just search for unicode characters. – Raphal Chen Mar 30 '17 at 14:37

1 Answers1

4

If you're fine with defining "unicode characters" as "non-ASCII" characters then you can do this with negative lookahead/lookbehind:

re.sub("(?<![ -~]) (?![ -~])", "", text)

If you don't like the ranges used ([ -~]) then this question has some alternatives. Additionally there are a variety of unicode categories that might serve your purpose better, but as far as I can tell you'll still have to manually define the character range as they're unsupported in the re module.

Community
  • 1
  • 1
medwards
  • 218
  • 1
  • 11