Cannot split a unicode string without converting to ascii - python 2.7

Question

I want to split the string I have £300 but it seems that the split function first converts it to a ascii and after. But I can't convert it back to unicode the same as it was before.

Is there any other way to split such a unicode string without breaking it as in the snippet bellow.

# -*- coding: utf-8 -*-
mystring = 'I have £300.'
alist = mystring.split()
alist = [item.decode("utf-8") for item in alist]
print "alist",alist
print "mystring.split()",mystring.split()

#I want to get [I,have,£300]
#I get: ['I', 'have', '\xc2\xa3300.']

Strings are ASCII in Python 2. – juanpa.arrivillaga Aug 29 '16 at 23:03 — juanpa.arrivillaga, Aug 29 '16 at 23:03
Ok, but how do i split in the way I want? – Brana Aug 29 '16 at 23:04 — Brana, Aug 29 '16 at 23:04

score 3 · Answer 1 · answered Aug 29 '16 at 23:04

3

You are looking at a limitation of the way python 2 displays data.

Using python 2:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '\xc2\xa3300.']

But, observe that it will print as you want:

>>> print(mystring.split()[2])
£300.

Using python 3, by contrast, it displays as you would like:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '£300.']

A major reason to use python 3 is its superior handling of unicode.

answered Aug 29 '16 at 23:04

John1024

109,961
14
137
171

Any workaround - as I have python 2.6.6 on my server? – Brana Aug 29 '16 at 23:07
@Brana If you `print` the string itself, as opposed to a list which contains it, then it will display as you want. – John1024 Aug 29 '16 at 23:11
In got that but there is a problem with other things such as processing the string, is it maybe possible to set default encoding to be utf-8? – Brana Aug 29 '16 at 23:23
_"there is a problem with other things such as processing the string"_ Can you be specific about that? – John1024 Aug 29 '16 at 23:38
Such as replacing the specific words, or determining if 2 words are equal - as this sometimes can be a problem as every split function makes unicode chars ascii... it will probably be the best to create my own function for spliting and decode every element into utf-8 and use it always instead of default split. – Brana Aug 29 '16 at 23:45
Or install python 3 as I do not think that any kind of linux comes with python 3. – Brana Aug 29 '16 at 23:46
1

@Brana As an aside, figuring out if two unicode strings are "equal" is a non-trivial problem by itself. – roeland Aug 29 '16 at 23:48
@Brana Have you tried running `python3`? Every linux that I use has python3 and has had it for many years. Python2 is still the default, `/usr/bin/python`, but version 3 is available under the name `python3`. – John1024 Aug 29 '16 at 23:50
@roeland - are you sure? I saw that unicode is trying to convert 2 string arguments to the same encoding (utf-8) in order to compare them, but if they can be converted it should be ok. If one is not utf-8 then it cannot compare them but thats ok by me. – Brana Aug 30 '16 at 02:19
@John1024 I tried /usr/bin/python3 and it doesn't work. I only use exec via php – Brana Aug 30 '16 at 02:20
# -*- coding: utf-8 -*- a1= u'Duda' a2= "Duda" print a1, a2 print a1 == a2 print "da" in a2 – Brana Aug 30 '16 at 02:22
I use this exec command - $mcd = 'python /home/domain/public_html/test.py'; exec($mcd.' "' . addslashes($thevariable) . '" ', $out, $r); – Brana Aug 30 '16 at 02:25
when i just change python to python3 it doesn't work :) – Brana Aug 30 '16 at 02:26
1

@Brana **tchrist** wrote of the best posts ever about Unicode ever on SO. It was about a Perl question but most of the answer generally applies to any program using Unicode. → [ ](http://stackoverflow.com/a/6163129/4447998) (scroll down to the laundry list, and to " ") – roeland Aug 30 '16 at 03:57
@Brana and trying to process text stored in an 8-bit string (which a python 2 `str` is) will just get you trouble. First decode to `unicode`, then do your processing. – roeland Aug 30 '16 at 04:01
Yes, i now first use text.decode("utf-8","ignore).encode("utf-8") and most non acii chars didn't give me errors. – Brana Aug 30 '16 at 05:55

score 1 · Answer 2 · answered Aug 29 '16 at 23:24

The problem is not with split(). The real problem is that the handling of unicode in python 2 is confusing.

The first line in your code produces a string, i.e. a sequence of bytes, which contains the utf-8 encoding of the symbol £. You can confirm this by displaying the repr of your original string:

>>> mystring
'I have \xc2\xa3300.'

The rest of the statements just do what you would expect them to with such input. If you want to work with unicode, create a unicode string to start with:

>>> mystring = u'I have £300.'

A far better solution, however, is to switch to Python 3. Wrapping your head around the semantics of unicode in python 2 is not worth the effort when there's such a superior alternative.

Cannot split a unicode string without converting to ascii - python 2.7

2 Answers2