How to correct the encoding of the data scraped with beautifulsoup?

Question

I am trying to write a python scraper using beautifulsoup. I successfully extracted most of the data, but I am facing now an encoding problem in the price extraction.

Here is my example:

The actual text is 1599€99

The scrapped text is:

>>>prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'

"\u20ac" is supposed to be the '€' symbol using UTF-8 encoding however:

>>>prdt.find("span",{"class":"price"}).text.encode(encoding='UTF-8')
'1599\xe2\x82\xac99'

Does anyone know how to fix this issue?

Thanks.

score 1 · Answer 1 · answered Nov 20 '16 at 22:52

1

It's representation of a unicode string. You may see its content by simply printing it.

>>> u1= u'1599\u20ac99'

>>> print u1
# prints 1599€99

>>> u2 = u'1599€99'

>>> u2
# prints u'1599\u20ac99'

answered Nov 20 '16 at 22:52

sardok

1,086
1
10
19

score 0 · Answer 2 · edited May 23 '17 at 10:32

Your script works well:

>>> prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'

The return value is a valid unicode string. The character u"\u20ac" is the EURO SIGN.

If you encode this character using 'utf8' encoding you get a bytes string.

>>> u'\u20ac'.encode('utf8')
b'\xe2\x82\xac'

This is the same code point encoded in UTF-8: E2 82 AC.

See also this answer to What is Unicode, UTF-8, UTF-16?.

How to correct the encoding of the data scraped with beautifulsoup?

2 Answers2