1

I have a string, decoded by UTF-8 but contains invalid unicode characters.

string = '칼 마르크스 「자본론\udb82\udc55Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'

Is there a way to remove any literal unicode character using regex?

I need to remove those literal unicode characters. Not to decode them into another form.


I am only able to remove them if I include the full literal unicode character, but I am unable to remove any literal unicode character.

re.sub('\udb82', '', string )

'칼 마르크스 「자본론\udc55Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udc55의 권수와 쪽수만 표기함―역자'


I know it is possible to replace the literal unicode character by using encode and decode, but I am looking for alternatives that can remove any literal unicode character directly.

string.encode('utf-8', 'replace').decode('utf-8')

'칼 마르크스 「자본론??Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론??의 권수와 쪽수만 표기함―역자'

cylim
  • 542
  • 1
  • 6
  • 15
  • The marked question does not solve my problem. I do not want to decode it into another form, I need to remove those literal unicode from the string. – cylim Jul 06 '20 at 04:05
  • Then please refer to [this thread](https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode). – metatoaster Jul 06 '20 at 04:32
  • 2
    Something like this - https://regex101.com/r/n7nRXq/1 ? – Jan Jul 06 '20 at 04:45
  • Thanks @Jan. What you suggested is exactly what I need. Do you know how to get that to work with `re.sub`? I tried `re.sub(r'\\u\w+', '', string)`, but it didn't work. – cylim Jul 06 '20 at 05:02
  • Thanks @Mandy8055. Your answer doesn't work in Python 3.6. It does not replace the literal unicode. – cylim Jul 06 '20 at 05:18
  • 1
    It is working. Please see [here.](https://onlinegdb.com/SkSVrExkv) The code will work now for both python 2 and 3. Notice the shebang line at the top for python 2 compatibility. –  Jul 06 '20 at 05:20
  • It works when the literal unicode are properly escaped (\\u instead of \u). I found an alternative solution here that works on unescaped literal unicode. https://stackoverflow.com/questions/38681921/python-re-sub-and-unicode – cylim Jul 06 '20 at 05:26
  • 1
    I'm glad it worked. Cheers =) –  Jul 06 '20 at 05:29

1 Answers1

1

You might actually not fiddle around with regular expressions but go for:

string = '칼 마르크스 「자본론\udb82\udc55Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'

print(string.encode('utf-8', 'ignore').decode('utf-8'))

Which yields

칼 마르크스 「자본론Ⅰ, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론의 권수와 쪽수만 표기함―역자
#            ^^^ - it's gone!
Jan
  • 42,290
  • 8
  • 54
  • 79