0

I get a string which includes Unicode characters. But the backslashes are escaped. I want to remove one backslash so python can treat the Unicode in the right way.

Using replace I am only able to remove and add two backslashes at a time.

my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\', '')

'\\uD83D\\uDE01\\n\\uD83D\\uDE01' should be '\uD83D\uDE01\n\uD83D\uDE01'

edit: Thank you for the many responses. You are right my example was wrong. Here are other things I have tried

my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\\\', '\\') # no unicode
my_str2 = my_str.replace('\\', '')
HennyKo
  • 712
  • 1
  • 8
  • 19
  • To my knowledge, this will not work. Writing unicode characters like this will only work in string literals. These are already strings and therefore you might need to do some code execution on the strings in question if you want these to be transformed into unicode characters. If you do that, be careful - this can end up executing arbitrary code. – Kendas Apr 24 '19 at 06:56
  • Try `print(my_str)` to see if the backslash is escaped or not. Probably shouldn't be. – lahsuk Apr 24 '19 at 06:57
  • What is your end goal with the `\uD83D\uDE01\n\uD83D\uDE01` output? – Devesh Kumar Singh Apr 24 '19 at 06:58

1 Answers1

4

That's… probably not going to work. Escape characters are handled during lexical analysis (parsing), what you have in your string is already a single backslash, it's just the escaped representation of that single backslash:

>>> r'\u3d5f'
'\\u3d5f'

What you need to do is encode the string to be "python source" then re-decode it while applying unicode escapes:

>>> my_str.encode('utf-8').decode('unicode_escape')
'\ud83d\ude01\n\ud83d\ude01'

However note that these codepoints are surrogates, and your string is thus pretty much broken / invalid, you're not going to be able to e.g. print it because the UTF8 encoder is going to reject it:

>>> print(my_str.encode('utf-8').decode('unicode_escape'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

To fix that, you need a second fixup pass: encode to UTF-16 letting the surrogates pass through directly (using the "surrogatepass" mode) then do proper UTF-16 decoding back to an actual well-formed string:

>>> print(my_str.encode('utf-8').decode('unicode_escape').encode('utf-16', 'surrogatepass').decode('utf-16'))


You may really want to do a source analysis on your data though, it's not logically valid to get a (unicode) string with unicode escapes in there, it might be incorrect loading of JSON data or somesuch. If it's an option (I realise that's not always the case) fixing that would be much better than applying hacky fixups afterwards.

Masklinn
  • 34,759
  • 3
  • 38
  • 57
  • Thank you very much for the explanation. I am getting my data from an SQLite DB where the escape char is not escaped again. It looks like the sqlite3 is doing the escaping. – HennyKo Apr 24 '19 at 07:12