Django extract string from unicode encoding

Question

I have the following string from which I want to extract "gcc-4.3.2" and "C"

u"u'gcc-4.3.2' u'C'"

I tried smart_str() and the output is the following

"u'gcc-4.3.2' u'C'"

Now I did split(" ")

tokens = ["u'gcc-4.3.2'", "u'C'"]

Then I tried

smart_str(tokens[0]), but it gives me the same thing

"u'gcc-4.3.2'"

How do I extract gcc-4.3.2 from it ?

(I want to do it for other values as well so I dont wanna hardcode)

Any help would be appreciated,

Thanks,

Pankaj.

How did you get that doubly-quoted string in the first place? Sounds like you should fix that, first. — Daniel Roseman, Apr 08 '12 at 13:19
Yeah, you appear to be getting ``repr(x)`` when you want ``str(x)``. — Gareth Latty, Apr 08 '12 at 13:20

score 2 · Accepted Answer · edited May 23 '17 at 12:04

Your real issue here seems to be the fact you are getting the representation of a value rather than the value.

>>> x = u"gcc-4.3.2"
>>> x
u'gcc-4.3.2'
>>> repr(x)
"u'gcc-4.3.2'"
>>> str(x)
'gcc-4.3.2'

If you have any control over the place you are getting the value from, I would go there first and deal with that.

Warning: Unicode is different for a reason, if you have unicode characters, you can run into issues:

>>> x = u"ĝĝ"
>>> x
u'\u011d\u011d'
>>> repr(x)
"u'\\u011d\\u011d'"
>>> str(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

If you have no control over the data you are getting, the value of repr(x) is an expression you can evaluate:

>>> x = "u'gcc-4.3.2'"
>>> eval(x)
u'gcc-4.3.2'

However, do note that eval is highly unsafe for a number of reasons.

If you want to deal with extracting the unicode strings more safely, you could do something like this:

>>> import re
>>> x = "u'gcc-4.3.2' u'C'"
>>> re.findall("u'(.*?)'", x)
['gcc-4.3.2', 'C']

Here we use a regular expression to extract anything in the string encased in u''. We use .*? to make the operation non-greedy, ensuring we don't end up with ["gcc-4.3.2' u'C"] as our output.

Django extract string from unicode encoding

1 Answers1