0

Here are a few examples (unicode) string:

a = u'\u03c3\u03c4\u03b7\u03bd \u03a0\u03bb\u03b1\u03c4\u03b5\u03af\u03b1 \u03c4\u03bf\u03c5'
b = u'\u010deprav so mu doma\u010di in strici duhovniki odtegovali denarno pomo\u010d . Kljub temu mu je uspelo'
c = u'sovi\xe9ticas excepto Georgia , inclusive las 3 rep\xfablicas que hab\xedan'

My end goal is to split on the backslashes (and spaces), so that it looks like this:

split_a = [u03c3, u03c4, u03b7, u03bd, ,u03a0, u03bb, u03b1, u03c4, u03b5, u03af, u03b1, ,u03c4, u03bf, u03c5]
split_b = ['', 'u010deprav', 'so', 'mu', 'doma', 'u010di', 'in', 'strici',  'duhovniki' odtegovali denarno pomo', 'u010d', '.', 'Kljub', 'temu', 'mu', 'je', 'uspelo']
split_c = ['sovi', 'xe9ticas', 'excepto', 'Georgia', ',', 'inclusive', 'las', '3',  'rep', 'xfablicas', 'que', 'hab', 'xedan']

(The empty places where there is both a space and a backslash are totally fine).

When I try to split using this:

a.split("\\"), it doesn't change the string at all.

I saw this example here, which makes me think that I need to make my strings literal strings (using r). However, I don't know how to convert my large list of strings into all literal strings.

When I searched on that, I got here. However, my compiler throws an error when I run a.encode('latin-1').decode('utf-8'). The error it throws is 'latin-1' codec can't encode characters in position 0-3: ordinal not in range(256)

So, my question is: How can I take a list of unicode strings, programmatically iterate through them and make them string literals, and then split on a backslash?

Community
  • 1
  • 1

2 Answers2

3

You have a Unicode string, which already has one Unicode codepoint per string element. The '\\' is just the representation of the string that is printed to the console, it's not the actual contents.

To make a list of numbers out of it is actually quite easy:

split_a = [ord(c) for c in a]

If you need to make a bunch of strings consisting of the letter u followed by the hex value, that's only slightly more complicated:

split_a = ', '.join('u' + ('%04x' % ord(c)) for c in a)
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • The second one solved my problem for my example above. I've edited my question to include some more sample unicode strings, let me know if you have a solution for those other types of strings. – python_in_trouble May 10 '16 at 16:10
  • Was just about to push submit on a similar solution, so I'll just add a follow up comment - you'd have to do a bit more work to only display the values for characters that are unknown encodings. Specifically, in the OP's example, rendering the space character as " ", vs. "u0020". – Christian May 10 '16 at 16:10
  • @python_in_trouble wow, that's a completely different problem now, much more complex. – Mark Ransom May 10 '16 at 16:32
1

You can use the unicode_escape code to translate a unicode string to its escaped representation.

split_a = a.encode('unicode_escape').split('\\')

outputs:

['',
 'u03c3',
 'u03c4',
 'u03b7',
 'u03bd ',
 'u03a0',
 'u03bb',
 'u03b1',
 'u03c4',
 'u03b5',
 'u03af',
 'u03b1 ',
 'u03c4',
 'u03bf',
 'u03c5']
Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895