0

I am trying to return html as a string from a eshop website but get back some weird characters. When I look at the webconsole I do not see these characters in the html. I also do not see these characters when the html is dispalyed in a pandas dataframe in jupyter notebook. The link is https://www.powerhousefilms.co.uk/collections/limited-editions/products/immaculate-conception-le. I am also using the same method for another product on this website but only see these character on this one page. The other pages in the site do not have this problem.

html = requests.get(url).text
soup = BeautifulSoup(html)
elem = soup.find_all('div', {'class': product-single_description rte'})
s = str(elem[0])

s then looks like:

    <div class="product-single__description rte">
<div class="product_description">
<div>
<div>
<div><span style="color: #000000;"><em>THIS ITEM IS AVAILABLE TO PRE-ORDER. PLEASE NOTE THAT YOUR PAYMENT WILL BE TAKEN IMMEDIATELY, AND THAT THE ITEM WILL BE DISPATCHED JUST BEFORE THE LISTED RELEASE DATE. </em></span></div>
<div><span style="color: #000000;"><em>Â </em></span></div>
<div><span style="color: #000000;"><em>SHOULD YOU ORDER ANY OF THEÂ ALREADY RELEASED ITEMS FROM OURÂ CATALOGUE AT THE SAME TIME AS THIS PRE-ORDER ITEM, PLEASE NOTE THATÂ YOUR PURCHASES WILL ALL BE SHIPPED TOGETHER WHENÂ THIS PRE-ORDERÂ ITEM BECOMES AVAILABLE.</em></span></div>
</div>
<div><span style="color: #38761d;">Â </span></div>
<div>
<strong>(Jamil Dehlavi, 1992)</strong><br/><em>Release date: 25 March 2019</em><br/>Limited Blu-ray Edition (World Blu-ray premiere)<br/><br/>A Western couple (played by Melissa Leo and James Wilby) working in Pakistan visit an unconventional holy shrine to harness its spiritual powers to help them conceive a child. They are lavished with the attentions of the shrine’s leader (an exceptional performance from Zia Mohyeddin – <em>Lawrence of Arabia</em>, <em>Khartoum</em>) and her followers, but their methods and motives are not all that they seem, and the couple’s lives are plunged into darkness.<br/><br/>This ravishing, unsettling film from director Jamil Dehlavi (<em>The Blood of Hussain</em>, <em>Born of Fire</em>) is a deeply personal work which raises questions of cultural and sexual identity, religious fanaticism and the abuses of power. The brand-new 2K restoration from the original negative was supervised and approved by Dehlavi and cinematographer Nic Knowland.<br/><br/><strong>INDICATOR LIMITED EDITION BLU-RAY SPECIAL FEATURES:</strong>
</div>
<div>
<ul>
<li>New 2K restoration by Powerhouse Films from the original negative, supervised and approved by director Jamil Dehlavi and cinematographer Nic Knowland</li>
<li>
<div>Original stereo audio</div>
</li>
<li>
<div>Alternative original mono mix</div>

I have tried specifying the encoding but still get the weird characters. For the 50 + products on this website only a few have this problem.

Is there a problem with how I am scraping or possibly an easy way to clean this up.

Thanks

Stig
  • 353
  • 1
  • 3
  • 12
  • You have to clean the html text to make it normal text. It is possible gimme 3 min – Maheshwar Kuchana Feb 11 '19 at 11:59
  • Thanks Maheshwar Kuchana I suspect cleaning the html is a workaround but I was hoping to be able to scrape it correctly in the first place. Perhaps I am just missing some encoding parameter – Stig Feb 11 '19 at 12:03

2 Answers2

0

Use this piece of code to download visible content in a webpage. Just put in the url in page_url

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import os


page_url = "URL Here"
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)

def Extract_Text(html_bytes, url):
    text_data = text_from_html(html_bytes)
    f = open("DOC.txt", "w")
    string = str(url) + "\n" + text_data
    f.write(str(string))
    f.close()

html_string = ''
response = urlopen(page_url)
if 'text/html' in response.getheader('Content-Type'):
    html_bytes = response.read()
    html_string = html_bytes.decode("utf-8")
Extract_Text(html_bytes, page_url)
0

So it turns out excel was the cause of this. When I save to CSV and open in excel I got the weird results.

To prevent this I used df.to_csv('df.csv', index=False, encoding = 'utf-8-sig'). Specifying the encoding got rid of the strange characters.

Python Writing Weird Unicode to CSV has some info about then encoding and how excel interpenetrates csv files.

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
Stig
  • 353
  • 1
  • 3
  • 12