0

I am trying to write a python scraper using beautifulsoup. I successfully extracted most of the data, but I am facing now an encoding problem in the price extraction.

Here is my example:

The actual text is 1599€99

The scrapped text is:

>>>prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'

"\u20ac" is supposed to be the '€' symbol using UTF-8 encoding however:

>>>prdt.find("span",{"class":"price"}).text.encode(encoding='UTF-8')
'1599\xe2\x82\xac99'

Does anyone know how to fix this issue?

Thanks.

user3351262
  • 11
  • 1
  • 3

2 Answers2

1

It's representation of a unicode string. You may see its content by simply printing it.

>>> u1= u'1599\u20ac99'

>>> print u1
# prints 1599€99

>>> u2 = u'1599€99'

>>> u2
# prints u'1599\u20ac99'
sardok
  • 1,086
  • 1
  • 10
  • 19
0

Your script works well:

>>> prdt.find("span",{"class":"price"}).text
u'1599\u20ac99'

The return value is a valid unicode string. The character u"\u20ac" is the EURO SIGN.

If you encode this character using 'utf8' encoding you get a bytes string.

>>> u'\u20ac'.encode('utf8')
b'\xe2\x82\xac'

This is the same code point encoded in UTF-8: E2 82 AC.

See also this answer to What is Unicode, UTF-8, UTF-16?.

Community
  • 1
  • 1
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103