4

I am trying to scrape names of all the items present on the webpage but by default only 18 are visible on the page & my code is scraping only those. You can view all items by clicking on "Show all" button but that button is in Javascript.

After some research, I found that PyQt module can be used to solve this issue involving javascript buttons & I used it but I am still not able to invoke the "on click" event. Below is the referred code:

import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://www.att.com/shop/wireless/devices/smartphones.html'  
r = Render(url)
jsClick = var evObj = document.createEvent('MouseEvents');
             evObj.initEvent('click', true, true );
             this.dispatchEvent(evObj);


allSelector = "a#deviceShowAllLink" # This is the css selector you actually need
allButton   = r.frame.documentElement().findFirst(allSelector)
allButton.evaluateJavaScript(jsClick)




page = allButton
soup = BeautifulSoup(page)
soup.prettify()
with open('Smartphones_26decv1.0.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Day of Week","Device Name","Price"])
    items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
    prices = soup.findAll('div', {"class": "listGrid-price"})
    for item, price in zip(items, prices):
        textcontent = u' '.join(price.stripped_strings)
        if textcontent:            
            spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%A") ,unicode(item.string).encode('utf8').strip(),textcontent])

Error which I am facing in this is as follows:

"Invalid Syntax" Error for evObj

Can someone please help me in invoking this "onclick" event so that I am able to scrape data for all items.Pardon me for my ignorance as I am new to programming.

  • 1
    You'd be better off interpreting what the JS *does*. It most likely just loads data via AJAX, probably as HTML or JSON. You can see what your browser does with the developer tools; all major browsers come with such tools, use their network tab to see what extra requests are done. – Martijn Pieters Dec 20 '12 at 20:42
  • @MartijnPieters After clicking on "Show all devices" network tab is showing me extra entries for devices which were hidden earlier with mehtod "Get" –  Dec 26 '12 at 07:48
  • @MartijnPieters, can you be more specific about how to scrap data with web pages with javascript? Any reference would do. – DJJ Nov 07 '14 at 18:05

2 Answers2

2
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait
from BeautifulSoup import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# use firefox to get page with javascript generated content
with closing(Firefox()) as driver:
    driver.get("http://www.att.com/shop/wireless/devices/smartphones.html")
    button = driver.find_element_by_id('deviceShowAllLink')
    button.click()
    # wait for the page to load
    element = WebDriverWait(driver, 10).until(
    EC.invisibility_of_element_located((By.ID, "deviceShowAllLink"))
    )
    # store it to string variable
    page_source = driver.page_source

soup = BeautifulSoup(page_source)
items = soup.findAll('div', {"class": "list-item"})
print "items count:",len(items)

will this help..?

Joseph Thomas
  • 484
  • 1
  • 5
  • 11
1

To click the button you must call evaluateJavascript over the object:

jsClick = """var evObj = document.createEvent('MouseEvents');
             evObj.initEvent('click', true, true );
             this.dispatchEvent(evObj);
             """

allSelector = "a#deviceShowAllLink" # This is the css selector you actually need
allButton   = r.frame.documentElement().findFirst(allSelector)
allButton.evaluateJavaScript(jsClick)
  • do I need to import something to run this code? After running this "selectorAll" is not defined error is coming –  Dec 21 '12 at 08:57
  • 1
    @user1915050 fixed an undefined variable in the code, sorry about that, see if it works now –  Dec 21 '12 at 15:28
  • Merry Christmas, I made changes according to your code and came across the above mentioned error -"Invalid Syntax" error for evObj. Please go through the updated code in my post above –  Dec 26 '12 at 06:21
  • Also why have you mentioned declaration of jsClick in comments? –  Dec 26 '12 at 09:49
  • 1
    @user1915050 That's because what you are running there is a piece of javascript code, I triple quoted it to allow the string to spawn into multiple rows without having to terminate each line with \n\, I could put it between single quotes and in one line, but I chose this method that improves readability, see what happens if you run that piece of code again exactly how it is in my example –  Dec 26 '12 at 10:45
  • I used this `html = allButton.webFrame().toHtml()` to feed the html to beautiful soup. I am not getting any error but output which I am getting is only for 18 default items which are visible by default on this webpage. I am using PyQt and evaluateJavascript to invoke onclick event for showing all devices on webpage so that I am able to extract data for all devices. Please go through the above code and help me in solving this. –  Dec 26 '12 at 18:56
  • Well, that seems to be a website related problem, I would first check with `QWebView` if I can manually click on that item –  Dec 26 '12 at 19:18
  • Hmm, thanks for your suggestion, I have created a new question for this problem @ http://stackoverflow.com/questions/14050585/issue-in-invoking-onclick-event-using-pyqt-javascript Please go through it and help in any way you can. –  Dec 27 '12 at 06:50