1

I want to split the string I have £300 but it seems that the split function first converts it to a ascii and after. But I can't convert it back to unicode the same as it was before.

Is there any other way to split such a unicode string without breaking it as in the snippet bellow.

# -*- coding: utf-8 -*-
mystring = 'I have £300.'
alist = mystring.split()
alist = [item.decode("utf-8") for item in alist]
print "alist",alist
print "mystring.split()",mystring.split()

#I want to get [I,have,£300]
#I get: ['I', 'have', '\xc2\xa3300.']
Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
Brana
  • 1,197
  • 3
  • 17
  • 38

2 Answers2

3

You are looking at a limitation of the way python 2 displays data.

Using python 2:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '\xc2\xa3300.']

But, observe that it will print as you want:

>>> print(mystring.split()[2])
£300.

Using python 3, by contrast, it displays as you would like:

>>> mystring = 'I have £300.'
>>> mystring.split()
['I', 'have', '£300.']

A major reason to use python 3 is its superior handling of unicode.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • Any workaround - as I have python 2.6.6 on my server? – Brana Aug 29 '16 at 23:07
  • @Brana If you `print` the string itself, as opposed to a list which contains it, then it will display as you want. – John1024 Aug 29 '16 at 23:11
  • In got that but there is a problem with other things such as processing the string, is it maybe possible to set default encoding to be utf-8? – Brana Aug 29 '16 at 23:23
  • _"there is a problem with other things such as processing the string"_ Can you be specific about that? – John1024 Aug 29 '16 at 23:38
  • Such as replacing the specific words, or determining if 2 words are equal - as this sometimes can be a problem as every split function makes unicode chars ascii... it will probably be the best to create my own function for spliting and decode every element into utf-8 and use it always instead of default split. – Brana Aug 29 '16 at 23:45
  • Or install python 3 as I do not think that any kind of linux comes with python 3. – Brana Aug 29 '16 at 23:46
  • 1
    @Brana As an aside, figuring out if two unicode strings are "equal" is a non-trivial problem by itself. – roeland Aug 29 '16 at 23:48
  • @Brana Have you tried running `python3`? Every linux that I use has python3 and has had it for many years. Python2 is still the default, `/usr/bin/python`, but version 3 is available under the name `python3`. – John1024 Aug 29 '16 at 23:50
  • @roeland - are you sure? I saw that unicode is trying to convert 2 string arguments to the same encoding (utf-8) in order to compare them, but if they can be converted it should be ok. If one is not utf-8 then it cannot compare them but thats ok by me. – Brana Aug 30 '16 at 02:19
  • @John1024 I tried /usr/bin/python3 and it doesn't work. I only use exec via php – Brana Aug 30 '16 at 02:20
  • # -*- coding: utf-8 -*- a1= u'Duda' a2= "Duda" print a1, a2 print a1 == a2 print "da" in a2 – Brana Aug 30 '16 at 02:22
  • I use this exec command - $mcd = 'python /home/domain/public_html/test.py'; exec($mcd.' "' . addslashes($thevariable) . '" ', $out, $r); – Brana Aug 30 '16 at 02:25
  • when i just change python to python3 it doesn't work :) – Brana Aug 30 '16 at 02:26
  • 1
    @Brana **tchrist** wrote of the best posts ever about Unicode ever on SO. It was about a Perl question but most of the answer generally applies to any program using Unicode. → [ ](http://stackoverflow.com/a/6163129/4447998) (scroll down to the laundry list, and to " ") – roeland Aug 30 '16 at 03:57
  • @Brana and trying to process text stored in an 8-bit string (which a python 2 `str` is) will just get you trouble. First decode to `unicode`, then do your processing. – roeland Aug 30 '16 at 04:01
  • Yes, i now first use text.decode("utf-8","ignore).encode("utf-8") and most non acii chars didn't give me errors. – Brana Aug 30 '16 at 05:55
1

The problem is not with split(). The real problem is that the handling of unicode in python 2 is confusing.

The first line in your code produces a string, i.e. a sequence of bytes, which contains the utf-8 encoding of the symbol £. You can confirm this by displaying the repr of your original string:

>>> mystring
'I have \xc2\xa3300.'

The rest of the statements just do what you would expect them to with such input. If you want to work with unicode, create a unicode string to start with:

>>> mystring = u'I have £300.'

A far better solution, however, is to switch to Python 3. Wrapping your head around the semantics of unicode in python 2 is not worth the effort when there's such a superior alternative.

alexis
  • 48,685
  • 16
  • 101
  • 161