How to convert a unicode string to a literal string in Python?

Question

Here are a few examples (unicode) string:

a = u'\u03c3\u03c4\u03b7\u03bd \u03a0\u03bb\u03b1\u03c4\u03b5\u03af\u03b1 \u03c4\u03bf\u03c5'
b = u'\u010deprav so mu doma\u010di in strici duhovniki odtegovali denarno pomo\u010d . Kljub temu mu je uspelo'
c = u'sovi\xe9ticas excepto Georgia , inclusive las 3 rep\xfablicas que hab\xedan'

My end goal is to split on the backslashes (and spaces), so that it looks like this:

split_a = [u03c3, u03c4, u03b7, u03bd, ,u03a0, u03bb, u03b1, u03c4, u03b5, u03af, u03b1, ,u03c4, u03bf, u03c5]
split_b = ['', 'u010deprav', 'so', 'mu', 'doma', 'u010di', 'in', 'strici',  'duhovniki' odtegovali denarno pomo', 'u010d', '.', 'Kljub', 'temu', 'mu', 'je', 'uspelo']
split_c = ['sovi', 'xe9ticas', 'excepto', 'Georgia', ',', 'inclusive', 'las', '3',  'rep', 'xfablicas', 'que', 'hab', 'xedan']

(The empty places where there is both a space and a backslash are totally fine).

When I try to split using this:

a.split("\\"), it doesn't change the string at all.

I saw this example here, which makes me think that I need to make my strings literal strings (using r). However, I don't know how to convert my large list of strings into all literal strings.

When I searched on that, I got here. However, my compiler throws an error when I run a.encode('latin-1').decode('utf-8'). The error it throws is 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

So, my question is: How can I take a list of unicode strings, programmatically iterate through them and make them string literals, and then split on a backslash?

Python is an interpreted language, so the Python interpreter throws the error. — linusg, May 10 '16 at 16:01
I think you're a bit above my level here, but thanks for the info! — python_in_trouble, May 10 '16 at 16:05

score 3 · Answer 1 · answered May 10 '16 at 16:03

3

You have a Unicode string, which already has one Unicode codepoint per string element. The '\\' is just the representation of the string that is printed to the console, it's not the actual contents.

To make a list of numbers out of it is actually quite easy:

split_a = [ord(c) for c in a]

If you need to make a bunch of strings consisting of the letter u followed by the hex value, that's only slightly more complicated:

split_a = ', '.join('u' + ('%04x' % ord(c)) for c in a)

answered May 10 '16 at 16:03

Mark Ransom

299,747
42
398
622

The second one solved my problem for my example above. I've edited my question to include some more sample unicode strings, let me know if you have a solution for those other types of strings. – python_in_trouble May 10 '16 at 16:10
Was just about to push submit on a similar solution, so I'll just add a follow up comment - you'd have to do a bit more work to only display the values for characters that are unknown encodings. Specifically, in the OP's example, rendering the space character as " ", vs. "u0020". – Christian May 10 '16 at 16:10
@python_in_trouble wow, that's a completely different problem now, much more complex. – Mark Ransom May 10 '16 at 16:32

score 1 · Accepted Answer · answered May 10 '16 at 16:08

1

You can use the unicode_escape code to translate a unicode string to its escaped representation.

split_a = a.encode('unicode_escape').split('\\')

outputs:

['',
 'u03c3',
 'u03c4',
 'u03b7',
 'u03bd ',
 'u03a0',
 'u03bb',
 'u03b1',
 'u03c4',
 'u03b5',
 'u03af',
 'u03b1 ',
 'u03c4',
 'u03bf',
 'u03c5']

answered May 10 '16 at 16:08

Daniel Roseman

588,541
66
880
895

This worked for me if I then iterated through the `split_a` list and further `split` on " " (space). – python_in_trouble May 10 '16 at 16:34

How to convert a unicode string to a literal string in Python?

2 Answers2