0

I am trying to get product data from Metal Mulisha, I have a list of product IDs that I need to find data on. So I use python with python package requests, with the search URL "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"

I then use BeautifulSoup to find the class and data I need, but I get an error that says there was nothing there.

So I first went to the URL in Chrome then inspected the elements and all the information I needed was in the html on Chrome.

Here is a snippet of what Chrome showed.

<div class="col-md-10 col-md-push-2">    
   <div data-rfkid="rfkid_7" data-keyphrase="20M35518334Z M45518403Z M45518415Z" class="rfk_sp rfk-sp">
       <div class="rfk_sp_container" data-nrp="2" data-ntp="2" data-pg="1" data-status="2" rfk_track_appear_once="f=sp,rfkid=rfkid_7,a=1,c=1">
       <div class="rfk_header">
              </div>
        <div class="rfk_message">
        <div class="rfk_msg_noresult">
           </div>
        <div class="rfk_msg_results">Top Results for "20m35518334z m45518403z m45518415z"</div>

It keeps continuing under the first div, all I am showing you is there in a lot of information after <div data-rfkid=.

Once I ran my python script to find the first div, this is what I get.

<div class="col-md-10 col-md-push-2">
   <div data-keyphrase="20M35518334Z M45518403Z M45518415Z" data-rfkid="rfkid_7"></div>
    </div>

As if all the product information that I need is not there.

Here is my python code, so you can see what I did. I am using python 3.5.

import requests
from bs4 import  BeautifulSoup 

url = "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"
html = requests.get(url).text
bs = BeautifulSoup(html, 'lxml')

possible_links = bs.find('div', attrs={'class': 'col-md-10 col-md-push-2'})
print(possible_links)

My question is why can't python find the html I need? If I inspect the site in Chrome I see it just fine, but when I use Python and request the site, it's not there. Is this to do with JavaScript? And if so how do I fix this?

Mullenb
  • 651
  • 6
  • 20
  • Not related to the question: You should always use `requests.get(url).content` instead of `.text` if you pass it to a parser like BeautifulSoup. – Simon Kirsten Jun 20 '16 at 17:16
  • The inside of the div is probably content that is loaded by javascript calls after rendering the initial html, that is why it doesn't show on your requests content. I'm not sure if Requests library can help you with this. You can try dryscrape: https://github.com/niklasb/dryscrape – Lucas Jun 20 '16 at 17:24
  • Here, some reference in SO: http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python – Lucas Jun 20 '16 at 17:26
  • Lucas, thanks for the input. I could not use dryscrape because it looked like they didn't support windows. So, I looked for other solutions on google, after you said it was related to javascript, and found [this](https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/) It uses `PyQt4.QtWebKit` to open the site with its javascript. It's slow, but it worked. – Mullenb Jun 20 '16 at 20:19
  • What are you trying to get? – Padraic Cunningham Jun 20 '16 at 21:51
  • I have a list of over 800 product IDs and so I am searching for them in batches of 80(this is because the website cannot handle more then that). I am trying to scrape links to the product pages 'href' so then I can scrape the actual product page. For price, description, image, and color. – Mullenb Jun 20 '16 at 21:54
  • Selenium may be your best bet, http://selenium-python.readthedocs.io/ – Padraic Cunningham Jun 20 '16 at 22:11

0 Answers0