Remove literal invalid unicode characters in a string

Question

I have a string, decoded by UTF-8 but contains invalid unicode characters.

string = '칼 마르크스 ｢자본론\udb82\udc55Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 ｢자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'

Is there a way to remove any literal unicode character using regex?

I need to remove those literal unicode characters. Not to decode them into another form.

I am only able to remove them if I include the full literal unicode character, but I am unable to remove any literal unicode character.

re.sub('\udb82', '', string )

'칼 마르크스 ｢자본론\udc55Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 ｢자본론\udc55의 권수와 쪽수만 표기함―역자'

I know it is possible to replace the literal unicode character by using encode and decode, but I am looking for alternatives that can remove any literal unicode character directly.

string.encode('utf-8', 'replace').decode('utf-8')

'칼 마르크스 ｢자본론??Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 ｢자본론??의 권수와 쪽수만 표기함―역자'

The marked question does not solve my problem. I do not want to decode it into another form, I need to remove those literal unicode from the string. — cylim, Jul 06 '20 at 04:05
Then please refer to [this thread](https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode). — metatoaster, Jul 06 '20 at 04:32
Thanks @Jan. What you suggested is exactly what I need. Do you know how to get that to work with `re.sub`? I tried `re.sub(r'\\u\w+', '', string)`, but it didn't work. — cylim, Jul 06 '20 at 05:02
Thanks @Mandy8055. Your answer doesn't work in Python 3.6. It does not replace the literal unicode. — cylim, Jul 06 '20 at 05:18
It is working. Please see [here.](https://onlinegdb.com/SkSVrExkv) The code will work now for both python 2 and 3. Notice the shebang line at the top for python 2 compatibility. — , Jul 06 '20 at 05:20
It works when the literal unicode are properly escaped (\\u instead of \u). I found an alternative solution here that works on unescaped literal unicode. https://stackoverflow.com/questions/38681921/python-re-sub-and-unicode — cylim, Jul 06 '20 at 05:26

score 1 · Accepted Answer · answered Jul 06 '20 at 05:42

You might actually not fiddle around with regular expressions but go for:

string = '칼 마르크스 ｢자본론\udb82\udc55Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 ｢자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'

print(string.encode('utf-8', 'ignore').decode('utf-8'))

Which yields

칼 마르크스 ｢자본론Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 ｢자본론의 권수와 쪽수만 표기함―역자
#            ^^^ - it's gone!

Remove literal invalid unicode characters in a string

1 Answers1