6

I have a string like so:

>>> t
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'

That I made using a function that converts unicode to the representative Python escape sequences. Then, when I want to convert it back, I can't get rid of the double backslash so that it is interpreted as unicode again. How can this be done?

>>> t = unicode_encode("
>>> t
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
>>> print(t)
\u0048\u0065\u006c\u006c\u006f\u0020\u20ac\u0020\u00b0    
>>> t.replace('\\','X')
'Xu0048Xu0065Xu006cXu006cXu006fXu0020Xu20acXu0020Xu00b0'
>>> t.replace('\\', '\\')
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'

Of course, I can't do this, either:

>>> t.replace('\\', '\')
  File "<ipython-input-155-b46c447d6c3d>", line 1
    t.replace('\\', '\')
                         ^
SyntaxError: EOL while scanning string literal
narnie
  • 1,742
  • 1
  • 18
  • 34

3 Answers3

9

Not sure if this is appropriate for your situation, but you could try using unicode_escape:

>>> t
'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
>>> type(t)
<class 'str'>
>>> enc_t = t.encode('utf_8')
>>> enc_t
b'\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u20ac\\u0020\\u00b0'
>>> type(enc_t)
<class 'bytes'>
>>> dec_t = enc_t.decode('unicode_escape')
>>> type(dec_t)
<class 'str'>
>>> dec_t
'Hello € °'

Or in abbreviated form:

>>> t.encode('utf_8').decode('unicode_escape')
'Hello € °'

You take your string and encode it using UTF-8, and then decode it using unicode_escape.

RocketDonkey
  • 36,383
  • 7
  • 80
  • 84
  • 1
    Thanks. I saw your earlier post and I tried it and realized that it needed converting into a binary object, which I did with bytes(t, 'utf8').decode('unicode_escape'), but I like how you did it above better. Thanks for pointing me in the right direction. Plus, I'll just use str.encode('unicode_escape') from now on to give me a binary to begin with. Thanks so much. – narnie Jan 22 '13 at 07:15
  • @narnie Ha, totally my bad - I did it in terms of Python 2.x then realized I should probably read more closely :) Good luck with everything! – RocketDonkey Jan 22 '13 at 07:16
  • No, you did me kindness by helping. I'm grateful. Thanks again. – narnie Jan 23 '13 at 04:50
0

Since a backslash is an escape character and you are searching for two backslashes you need to replace four backslashes with two - i.e.:

t.replace("\\\\", "\\")

This will replace every r"\\" with r"\". The r indicates raw string. So, for example, if you type print(r"\\") into idle or any python script (or print r"\\" in Python 2) you will get \\\\. This means that every "\\" is really just a r"\".

user1632861 suggested that you use .replace("\\", ""), but this replaces ever r"\" with nothing. Try the above method instead. :D

In this case, however, it appears as though you are reading/receiving data, and you probably want to use the correct encoding and then decode to unicode (as the person above me suggested).

dylnmc
  • 3,810
  • 4
  • 26
  • 42
-1

You only got one backslash in your code, but backslashes are represent as \\. As you can see, when you use print(), there's only one backslash. So if you want to get rid of one of the two backslashes, don't do anything, it's not there. If you wanna get rid of both, just remove one. Again use \\ to represent one backslash: t.replace("\\", "")

So your string never has two backslashes in the first place, it shouldn't be the problem.

  • Tried that, it doesn't work. What we're dealing with here is the fact that `t='Hello \u20AC'` is interpreted as `\u20AC` being one character and converted to the euro. It is special handling. That is where the rub comes in. Solution is by @RocketDonkey. – narnie Jan 22 '13 at 07:18