How to remove '\xa0' in html source?

Question

I am trying to scrape the price information of an Amazon Page using beautiful soup.

The code was written on macOS Catalina (Version 10.15.5) and the web browser used was google chrome Version 84.0.4147.135 (Official Build) (64-bit). Python Version 3.8.2.

As you can see the output (price) on the last line from the code below.

Is there a way to remove the unwanted characters from the output or improve my code so the final output (price) reflects just ₹1,700.00?

The unwanted characters are " \xa0 "

Also, is there an explanation for these characters as to what do they mean and why do they appear as part of the output. Thanks.

Please refer to the code below:

import bs4

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}

res = requests.get('https://www.amazon.in/Automate-Boring-Python-Albert-Sweigart/dp/1593275994', headers=headers)

res.raise_for_status()

soup = bs4.BeautifulSoup(res.text)

soup.select('#soldByThirdParty > span')

[₹ 1,700.00]

elems = soup.select('#soldByThirdParty > span')

elems[0].text

'₹\xa01,700.00'

You can refer to [this](https://stackoverflow.com/questions/1449059/why-is-this-a0-character-appearing-in-my-htmlelement-output) for `\xa0` and a simple string split and concat can get your desired result. — lincr, Aug 31 '20 at 12:32

score 1 · Answer 1 · answered Aug 31 '20 at 12:35

1

For replacing your unwanted charachters you can use a classic replace() function like this:

price = elems[0].text.replace(u'\xa0', u'')

If you want further information about the \xa0 character I can suggest you this

answered Aug 31 '20 at 12:35

Giovanni

212
2
12

How to remove '\xa0' in html source?

1 Answers1