6

I am trying to scrape data off a website using Scrapy, a python framework. I can get the data from the website using the spiders but the problem occurs when I try to navigate through the website.

According to this post Scrapy does not handle Javascript well.

Also, as stated in the accepted answer, I cannot use mechanize or lxml. It suggests using a combination of Selenium and Scrapy.

Function of the button:

I am browsing through offers on a website. The function of the button is to show more offers. SO on clicking it, it calls a javascript function which loads the results.

I also was looking at CasperJS and PhantomJS. Will they work?

I just need to automate the clicking of a button. How do I go about this?

Community
  • 1
  • 1
praxmon
  • 5,009
  • 22
  • 74
  • 121
  • Really depends on the button. Can you share the details? – alecxe Jan 07 '15 at 05:57
  • If you use Selenium, the Javascript will execute in an otherwise normal browser. You can certainly automate simple button clicks with only Selenium IDE or WebDriver. – BadZen Jan 07 '15 at 05:59

1 Answers1

4

First of all, yes - you can use PhantomJS ghostdriver with python. It is built-in to python-selenium:

pip install selenium

Demo:

>>> from selenium import webdriver
>>> driver = webdriver.PhantomJS()
>>> driver.get('https://stackoverflow.com/questions/27813251')
>>> driver.title
u'javascript - Web scraping: Automating button click - Stack Overflow'

There are also several other threads that provide examples of "scrapy+selenium" spiders:

Also there is a scrapy-webdriver module that can probably help with it too.


Using scrapy with selenium would give you a huge overhead and slow things down drammatically even with a headless PhantomJS browser.

There is a huge chance you can mimic that "show more offers" button click by simulating the underlying request going to get the data you need. Use browser developer tools to explore what kind of request is fired and use scrapy.http.Request for simulation inside the spider.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you for this answer. I have started to work on something using our answer as the base. I wanted more clarity on some details here. In the example you have illustrated that I can get the page title using `PhantomJS`. So, what's the use of scrapy here? I basically want to know the difference between their usage. Could you explain? – praxmon Jan 13 '15 at 05:08
  • 2
    @PrakharMohanSrivastava the key thing is that scrapy is not a browser and doesn't have a javascript engine built-in. A lot of sites use javascript to construct their pages - this javascript code is executed in the browser - browser follows the `script` links, loads additional js files, executes the code, changes the DOM - does a lot of things. For these sites, it is easier to use a real browser to construct the page, as you would see it in the browser developer tools. Then, you can feed the resulting `.page_source` to scrapy for processing. Hope that makes things a bit more clear. – alecxe Jan 13 '15 at 05:13