Replace special characters in python

Question

I have some text coming from the web as such:

Â£6.49

Obviously I would like this to be displayed as:

£6.49

I have tried the following so far:

s = url['title']
s = s.encode('utf8')
s = s.replace(u'Â','')

And a few variants on this (after finding it on this very same forum)

But still no luck as I keep getting:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 100: ordinal not in range(128)

Could anyone help me getting this right?

UPDATE:

Adding the repr examples and content type

u'Star Trek XI &#xA3;3.99'
u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'
Content-Type: text/html; charset=utf-8

Thanks in advance.

Please post the `repr(...)` of the string from the web. Then we'll know for sure what we are dealing with. — unutbu, Jan 16 '11 at 14:19
And why it is bad in XML? Are you trying to work around inconsistent input? Yep, and repr(url['title']) would probably help. — ondra, Jan 16 '11 at 14:23
It might also help to post the `Content-Type` header: `response=urllib2.urlopen(url);content_type=response.headers.getheader('Content-Type')` — unutbu, Jan 16 '11 at 14:24

score 7 · Accepted Answer · edited May 23 '17 at 12:01

If, s=url['title'] makes s equal to this:

In [48]: s=u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'

Then the problem is

in the code that defines url,
or else the content from the web is mal-formed.

If Case 1, we'd need to see the code that defines url.

If Case 2, a quick-and-dirty workaround would be to encode the unicode object s with the raw-unicode-escape codec:

In [49]: print(s)
Oscar Winners Best Pictures Box Set Â£6.49

In [50]: print(s.encode('raw-unicode-escape'))
Oscar Winners Best Pictures Box Set £6.49

See also this SO question.

Regarding titles like s=u'Star Trek XI £3.99': Again, it would be nice fix the problem before it gets to this stage -- perhaps by looking at how url is defined. But assuming the content from the web is mal-formed, a workaround would be:

In [86]: import re

In [87]: print(re.sub(r'&#x([a-fA-F\d]+);',lambda m: unichr(int(m.group(1),base=16)),s))
Star Trek XI £3.99

A little bit of explanation:

Note that

In [51]: x=u'£'
In [53]: x.encode('utf-8')
Out[53]: '\xc2\xa3'

So the unicode object u'£', encoded with the utf-8 codec, becomes the string object '\xc2\xa3'.

Somehow, url['title'] is getting defined to be the unicode object u'\xc2\xa3'. (The u makes a big difference!)

Thus we have u'\xc2\xa3' when we desire '\xc2\xa3'. Encoding the unicode object u'\xc2\xa3' with the raw-unicode-escape codec transforms it to '\xc2\xa3'.

Hi, we're almost there, your code did the trick for the second string, but the first one still displays wrong (Star Trek XI £3.99) — Marcos Placona, Jan 16 '11 at 15:03

ondra · Answer 2 · 2011-01-16T14:37:23.230

0

Edit: you have your objects already in unicode. Seems to me there is no reason to actually use enocde/decode at all.

>>> print u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'.replace(u'Â','')
Oscar Winners Best Pictures Box Set £6.49

However it seems to me that something is wrong there. The unicode objects are actually not unicode; see:

>>> print 'Oscar Winners Best Pictures Box Set \xc2\xa36.49'.decode('utf8')
Oscar Winners Best Pictures Box Set £6.49

The repr() you posted should not be unicode object. That's why I was asking where are you getting the data, there is something wrong.

edited Jan 16 '11 at 14:37

answered Jan 16 '11 at 14:31

ondra

9,122
1
25
34

UnicodeEncodeError: 'ascii' codec can't encode characters in position 100-101: ordinal not in range(128) – Marcos Placona Jan 16 '11 at 14:33
You should post repr() of the string that this fails for and the particular line (i.e. does it fail on the decode?). – ondra Jan 16 '11 at 14:36

Replace special characters in python

2 Answers2

Linked