I am trying to return html as a string from a eshop website but get back some weird characters. When I look at the webconsole I do not see these characters in the html. I also do not see these characters when the html is dispalyed in a pandas dataframe in jupyter notebook. The link is https://www.powerhousefilms.co.uk/collections/limited-editions/products/immaculate-conception-le. I am also using the same method for another product on this website but only see these character on this one page. The other pages in the site do not have this problem.
html = requests.get(url).text
soup = BeautifulSoup(html)
elem = soup.find_all('div', {'class': product-single_description rte'})
s = str(elem[0])
s then looks like:
<div class="product-single__description rte">
<div class="product_description">
<div>
<div>
<div><span style="color: #000000;"><em>THIS ITEM IS AVAILABLE TO PRE-ORDER. PLEASE NOTE THAT YOUR PAYMENT WILL BE TAKEN IMMEDIATELY, AND THAT THE ITEM WILL BE DISPATCHED JUST BEFORE THE LISTED RELEASE DATE. </em></span></div>
<div><span style="color: #000000;"><em>Â </em></span></div>
<div><span style="color: #000000;"><em>SHOULD YOU ORDER ANY OF THEÂ ALREADY RELEASED ITEMS FROM OURÂ CATALOGUE AT THE SAME TIME AS THIS PRE-ORDER ITEM, PLEASE NOTE THATÂ YOUR PURCHASES WILL ALL BE SHIPPED TOGETHER WHENÂ THIS PRE-ORDERÂ ITEM BECOMES AVAILABLE.</em></span></div>
</div>
<div><span style="color: #38761d;">Â </span></div>
<div>
<strong>(Jamil Dehlavi, 1992)</strong><br/><em>Release date: 25 March 2019</em><br/>Limited Blu-ray Edition (World Blu-ray premiere)<br/><br/>A Western couple (played by Melissa Leo and James Wilby) working in Pakistan visit an unconventional holy shrine to harness its spiritual powers to help them conceive a child. They are lavished with the attentions of the shrine’s leader (an exceptional performance from Zia Mohyeddin – <em>Lawrence of Arabia</em>, <em>Khartoum</em>) and her followers, but their methods and motives are not all that they seem, and the couple’s lives are plunged into darkness.<br/><br/>This ravishing, unsettling film from director Jamil Dehlavi (<em>The Blood of Hussain</em>, <em>Born of Fire</em>) is a deeply personal work which raises questions of cultural and sexual identity, religious fanaticism and the abuses of power. The brand-new 2K restoration from the original negative was supervised and approved by Dehlavi and cinematographer Nic Knowland.<br/><br/><strong>INDICATOR LIMITED EDITION BLU-RAY SPECIAL FEATURES:</strong>
</div>
<div>
<ul>
<li>New 2K restoration by Powerhouse Films from the original negative, supervised and approved by director Jamil Dehlavi and cinematographer Nic Knowland</li>
<li>
<div>Original stereo audio</div>
</li>
<li>
<div>Alternative original mono mix</div>
I have tried specifying the encoding but still get the weird characters. For the 50 + products on this website only a few have this problem.
Is there a problem with how I am scraping or possibly an easy way to clean this up.
Thanks