Python. Convert escaped utf string to utf-string

Question

Is there any built in way to do this?

rawstr = r"3 \u176? \u177? 0.2\u176? (2\u952?)"
#required str is 3 ° ± 0.2° (2θ).

something like

In [1] rawstr.unescape()?
Out[1]: '3° ± 0.2° 2θ'

The question is how to convert rawstr to 'utf-8'.

Please see my answer for more clarity.

Please answer if better option than what I am doing right now.

you could use `codecs.raw_unicode_escape_decode`. Unfortunately your raw string contains invalid unicode escapes, hence it does not work (I'm referring to `\u176?`. They should be in the form `\uXXXX`) — Bakuriu, Mar 02 '17 at 06:37
Alternatively, create a bytestring (use `rb` as prefix) and use `.decode('unicode-escape')`, but this again fails because `\u176?` is not a valid unicode escape. — Bakuriu, Mar 02 '17 at 06:39
Possible duplicate of [How to decode string representative of utf-8 with python?](http://stackoverflow.com/questions/39035899/how-to-decode-string-representative-of-utf-8-with-python) — tripleee, Mar 02 '17 at 07:58

math2001 · Answer 1 · 2017-03-02T07:04:25.200

2

Yep, there is!

For python 2:

print r'your string'.decode('string_escape')

For python 3, you need to transform it as bytes, and then use decode:

print(rb'your string'.decode('unicode_escape'))

Note that this doesn't work in your case, since your symbols aren't escaped properly (even if you print them using the "normal" way, it doesn't work).

Your string should be like this:

rb'3\u00B0 \u00b1 0.2\u00B0 2\u03B8'

Note that if you need to transform a string to bytes in python, you can use the bytes function.

my_str = r'3\u00B0 \u00b1 0.2\u00B0 2\u03B8'
my_bytes = bytes(my_str, 'utf-8')
print my_bytes.decode('string_escape') # python 2
print(my_bytes.decode('unicode_escape')) # python 3

edited Mar 02 '17 at 07:04

answered Mar 02 '17 at 06:46

math2001

4,167
24
35

I thinks it is ansi text. – Rahul Mar 02 '17 at 07:03
"ANSI text" is not a well-defined term. On Windows, it was misleadingly used in the past to refer to the system's local default encoding, which was widely further misinterpreted to be a particular code page (commonly 1252, though you see all of 437, 850, and whatever is the default in the reader's locale). – tripleee Mar 02 '17 at 07:55

score 1 · Accepted Answer · answered Mar 02 '17 at 11:49

If you are on windows and pythonnet installed

import clr
clr.AddReference("System")
clr.AddReference("System.Windows.Forms")
import System.Windows.Forms as WinForms

def rtf_to_text(rtf_str):
    """Converts rtf to text"""

    rtf = r"{\rtf1\ansi\ansicpg1252" + '\n' + rtf_str + '\n' + '}'
    richTextBox = WinForms.RichTextBox()
    richTextBox.Rtf = rtf
    return richTextBox.Text

print(rtf_to_text(r'3 \u176? \u177? 0.2\u176? (2\u952?)'))
-->'3 ° ± 0.2° (2θ)'

Python. Convert escaped utf string to utf-string

2 Answers2