3

I have this str (coming from a file I can't fix):

In [131]: s
Out[131]: '\\xce\\xb8Oph'

This is close to the repr of a string encoded in utf8:

In [132]: repr('θOph'.encode('utf8'))
Out[132]: "b'\\xce\\xb8Oph'"

I need the original encoded string. I can do it with

In [133]: eval("b'{}'".format(s)).decode('utf8')
Out[133]: 'θOph'

But I would be ... sad? if there were no simpler option to get it. Is there a better way?

matiasg
  • 1,927
  • 2
  • 24
  • 37

4 Answers4

7

Your solution is OK, the only thing is that eval is dangerous when used with arbitrary inputs. The safe alternative is to use ast.literal_eval:

>>> s = '\\xce\\xb8Oph'
>>> from ast import literal_eval
>>> literal_eval("b'{}'".format(s)).decode('utf8')
'\u03b8Oph'

With eval you are subject to:

>>> eval("b'{}'".format("1' and print('rm -rf /') or b'u r owned")).decode('utf8')
rm -rf /
'u r owned'

Since ast.literal_eval is the opposite of repr for literals, I guess it is what you are looking for.

[updade]

If you have a file with escaped unicode, you may want to open it with the unicode_escape encoding as suggested in the answer by Ginger++. I will keep my answer because the question was "how to convert repr into encoded string", not "how to decode file with escaped unicode".

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
  • 3
    Bravo for literal_eval() :D And thank you for support of eval() as a good way to do it. Still, don't forget about not escaped apostrophes and/or quotes. – Dalen Jul 15 '16 at 17:19
  • eval("b'%s'" % "'") will raise the SyntaxError. So this solution still isn't so straightforward as it seems at first glance. – Dalen Jul 15 '16 at 17:27
  • May be with a pinch of `s.replace("'", "\\'")` sprinkled here or there... – Paulo Scardine Jul 15 '16 at 17:49
  • Yap, s.replace() indeed can help with the trick. :D Woops, this is third function that will traverse the poor string. First replace(), then format() (that can be ignored), then literal_eval() (which probably calls eval() at the end) of checks. No wonder the OP finds it a bit ugly. :D Are we sure there is no more elegant solution to [literal_]eval() and my own? Perhaps some library that deals with encodings not covered in codecs? – Dalen Jul 15 '16 at 18:12
  • 1
    Actually, there is! See the answer by GingerPP! So, there is actually an escaped codec in standard lib. Wow! Python constantly amazes me. But literal_eval() is still valid and good solution to it. Although using codecs is really, the most natural and logical one. Now we know that the codec exists. – Dalen Jul 15 '16 at 18:22
4

Just open your file with unicode_escape encoding, like:

with open('name', encoding="unicode_escape") as f:
    pass # your code here

Original answer:

>>> '\\xce\\xb8Oph'.encode('utf-8').decode('unicode_escape')
'θOph'

You can get rid of that encoding to UTF-8, if you read your file in binary mode instead of text mode:

>>> b'\\xce\\xb8Oph'.decode('unicode_escape')
'θOph'
GingerPlusPlus
  • 5,336
  • 1
  • 29
  • 52
  • Aha! I knew that it lurked somewhere just beyond the reach! Excellent! – Dalen Jul 15 '16 at 18:19
  • 1
    This seems to be the way to go. But 'θOph' is not what I want, but 'θOph'. I can get it with this: `s.encode('utf8').decode('unicode_escape').encode('latin1').decode('utf8')`, which looks completely weird for me. – matiasg Jul 18 '16 at 14:13
  • @matiasg arguably it's because whoever wrote to the file isn't doing the right thing with their encoding, so you have to do their job on this end. Which is ridiculous, but you know, that's your life. – Wayne Werner Jul 18 '16 at 14:28
  • @matiasg: I thought it's an issue with my terminal, I now see it's not. I'm trying to figure out what is going on. Do you know how that file was created? – GingerPlusPlus Jul 18 '16 at 16:03
  • `unicode_escape` considers each escape sequence to be separate character, and that's why it's not working here (`'θ'` is represented by 2 bytes) – GingerPlusPlus Jul 18 '16 at 16:34
2

Unfortunately, this is really problematic. It's \ killing you softly here.

I can only think of:

s = '\\xce\\xb8Oph\\r\\nMore test\\t\\xc5\\xa1'
n = ""
x = 0
while x!=len(s):
    if s[x]=="\\":
        sx = s[x+1:x+4]
        marker = sx[0:1]
        if   marker=="x": n += chr(int(sx[1:], 16)); x += 4
        elif marker in ("'", '"', "\\", "n", "r", "v", "t", "0"):
            # Pull this dict out of a loop to speed things up
            n += {"'": "'", '"': '"', "\\": "\\", "n": "\n", "r": "\r", "t": "\t", "v": "\v", "0": "\0"}[marker]
            x += 2
        else: n += s[x]; x += 1
    else: n += s[x]; x += 1
print repr(n), repr(s)
print repr(n.decode("UTF-8"))

There might be some other trick to pull this off, but at the moment this is all I got.

Dalen
  • 4,128
  • 1
  • 17
  • 35
  • hmmm, thanks, but this is more involved than the solution I currently have. – matiasg Jul 15 '16 at 12:16
  • Your solution is good. And nothing lacks there, but you have to refactor it a little because in its current state you will have problems when the apostrophe appears in the input string without being escaped by \\. No, my solution is actually skipping a lot of things that occur before the similar algorithm is applied inside eval(). So you are doing essentially the same thing. Its just a little bit longer. I don't see a problem here because you can turn it into a function and use it at your leasure. :D – Dalen Jul 15 '16 at 16:44
  • 1
    So, see, nothing is wrong with your approach. It may even be faster as eval() is implemented in C while my code is pure Python. You asked for an alternative, and I provided one. If you are thinking to find a specific built-in function dealing with your problem, well, I think you are not in luck, because I am not aware that such function exists. It may still be somewhere out there, but I seriously doubt it. eval() is meant for such stuff. It's nothing wrong about using it in this way, you can stop looking for alternatives. Just be cautious with it to prevent its abuse from outside of your app. – Dalen Jul 15 '16 at 16:55
  • 1
    Watch out, `eval` is a security liability. Use `ast.literal_eval` instead. – Paulo Scardine Jul 15 '16 at 17:14
0

To make a teeny improvement on GingerPlusPlus's answer:

import tempfile                                                        

with tempfile.TemporaryFile(mode='rb+') as f:                          
    f.write(r'\xce\xb8Oph'.encode())                                   
    f.flush()                                                          
    f.seek(0)                                                          

    print(f.read().decode('unicode_escape').encode('latin1').decode()) 

If you open the file in binary mode (i.e. rb, since you're reading, I added + since I was also writing to the file) you can skip the first encode call. It's still awkward, because you have to bounce through the decode/encode hop, but you at least do get to avoid that first encoding call.

Wayne Werner
  • 49,299
  • 29
  • 200
  • 290