Scraping dynamic content using python-Scrapy

Question

Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website.

I'm using Python-Scrapy for getting data from koovs.com.

However, I'm not able to get the product size, which is dynamically generated. Specifically, if someone could guide me a little on getting the 'Not available' size tag from the drop-down menu on this link, I'd be grateful.

I am able to get the size list statically, but doing that I only get the list of sizes but not which of them are available.

Correct me if I'm wrong, you are able to get the list of sizes, but having difficulties filtering only available sizes? — alecxe, May 20 '15 at 09:36
Exactly! I am able to get them statically and doing that I only get the list of sizes and not which of them are available. I'll add this to the question. — Pravesh Jain, May 20 '15 at 09:58
I've never really used selenium but if it's required only to get some data and not required during the actual scraping then it's good. Could you guide me a little on how it would be used? — Pravesh Jain, May 21 '15 at 10:03
[This question](https://stackoverflow.com/questions/31174330/passing-selenium-response-url-to-scrapy%20helped) and [this one](https://stackoverflow.com/questions/19327406/how-to-set-different-scrapy-settings-for-different-spiders) helped me a lot — Henadzi Rabkin, Dec 22 '19 at 22:31

score 56 · Accepted Answer · answered May 21 '15 at 15:56

56

You can also solve it with ScrapyJS (no need for selenium and a real browser):

This library provides Scrapy+JavaScript integration using Splash.

Follow the installation instructions for Splash and ScrapyJS, start the splash docker container:

$ docker run -p 8050:8050 scrapinghub/splash

Put the following settings into settings.py:

SPLASH_URL = 'http://192.168.59.103:8050' 

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

And here is your sample spider that is able to see the size availability information:

# -*- coding: utf-8 -*-
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["koovs.com"]
    start_urls = (
        'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376',
    )

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        for option in response.css("div.select-size select.sizeOptions option")[1:]:
            print option.xpath("text()").extract()

Here is what is printed on the console:

[u'S / 34 -- Not Available']
[u'L / 40 -- Not Available']
[u'L / 42']

answered May 21 '15 at 15:56

alecxe

462,703
120
1,088
1,195

Both great answers. Both the approaches work. I wonder if there is an advantage using one of them over the other? – Pravesh Jain May 25 '15 at 10:18
3

@PraveshJain from what I understand, if you are okay with both the approaches, I would stick to splash - in theory, this should be faster since it doesn't involve a real browser at all. Besides, you can use this option in a non-real-screen headless environment. It is also easy to set up and there are almost no changes to the scrapy code - the key part is the middleware that scrapyjs provides. Hope that helps. – alecxe May 25 '15 at 13:20
$ docker run -p 8050:8050 scrapinghub/splash - this command..how can i automate this command along with scrapy to scrape data using a cron job scheduler.. it obviously is not a great idea to keep docker process running at all time..may be some sh script before i make call to reactor at scheduled time ? – MrPandav Jul 03 '15 at 09:40
@MrPandav okay, you are probably asking about docker python client, see https://github.com/docker/docker-py – alecxe Jul 03 '15 at 13:05
@alecxe When I use the exact same solution, it times out again and again. I just change the start-url to something that is available right now in store. – user_3068807 Feb 26 '16 at 00:59
Might be a silly question, but where is the settings.py file? I looked in applications, the repo area Docker created, and a few other random places I thought it might be. No idea where to find it. I'm running OS X High Sierra. – Chelsea Jan 08 '18 at 00:06
1

@Chelsea the settings.py should be stored in ur project directory. ProjectName > projectName > settings.py – Genfood Jan 25 '18 at 19:29
is docker available on windows 10 and is it free?! – oldboy Jun 13 '18 at 06:58
@alecxe it technically does use a browser: `It's a lightweight browser with an HTTP API` – Tjorriemorrie Aug 09 '18 at 04:24
Just a silly doubt - the host ip in SPLASH_URL is docker ip address right? I am trying docker ip address but it is giving me timeout – Plasmatiger May 25 '20 at 23:12
```Retrying (failed 1 times): TCP connection timed out: 60: Operation timed out. 2020-05-26 04:41:18 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-05-26 04:41:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying (failed 2 times): TCP connection timed out: 60: Operation timed out.``` – Plasmatiger May 25 '20 at 23:13
Okay I switched to 0.0.0.0 and now response.text is having encoding problem. Should I encode it manually or am doing something wrong? – Plasmatiger May 25 '20 at 23:20
1

@Plasmatiger run `docker-machine ip default` in docker then change your SPLASH_URL to that, worked for me to resolve your issue. – Winters Jul 21 '20 at 02:16

score 10 · Answer 2 · edited May 23 '17 at 12:18

From what I understand, the size availability is determined dynamically in javascript being executed in the browser. Scrapy is not a browser and cannot execute javascript.

If you are okay with switching to selenium browser automation tool, here is a sample code:

from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox()  # can be webdriver.PhantomJS()
browser.get('http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376')

# wait for the select element to become visible
select_element = WebDriverWait(browser, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.select-size select.sizeOptions")))

select = Select(select_element)
for option in select.options[1:]:
    print option.text

browser.quit()

It prints:

S / 34 -- Not Available
L / 40 -- Not Available
L / 42

Note that in place of Firefox you can use other webdrivers like Chrome or Safari. There is also an option to use a headless PhantomJS browser.

You can also combine Scrapy with Selenium if needed, see:

score 4 · Answer 3 · answered Jun 05 '17 at 14:11

4

I faced that problem and solved easily by following these steps

pip install splash
pip install scrapy-splash
pip install scrapyjs

download and install docker-toolbox

open docker-quickterminal and enter

$ docker run -p 8050:8050 scrapinghub/splash

To set the SPLASH_URL check the default ip configured in the docker machine by entering
$ docker-machine ip default (My IP was 192.168.99.100)

SPLASH_URL = 'http://192.168.99.100:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

That's it!

answered Jun 05 '17 at 14:11

Srivardhan Cholkar

130
1
9

docker toolbox link is broken... :.( – oldboy Jun 13 '18 at 06:59
@Anthony - You can get docker from here: https://www.docker.com/get-docker The Docker Toolbox is for older Mac and Windows systems that do not meet the requirements of [Docker for Mac](https://docs.docker.com/docker-for-mac/) and [Docker for Windows](https://docs.docker.com/docker-for-windows/). – Tony Jun 22 '18 at 18:25

score 0 · Answer 4 · edited Mar 10 '20 at 02:55

You have to interpret the json of the website, examples scrapy.readthedocs and testingcan.github.io

import scrapy
import json
class QuoteSpider(scrapy.Spider):
   name = 'quote'
   allowed_domains = ['quotes.toscrape.com']
   page = 1
   start_urls = ['http://quotes.toscrape.com/api/quotes?page=1']

   def parse(self, response):
      data = json.loads(response.text)
      for quote in data["quotes"]:
        yield {"quote": quote["text"]}
      if data["has_next"]:
          self.page += 1
          url = "http://quotes.toscrape.com/api/quotes?page={}".format(self.page)
          yield scrapy.Request(url=url, callback=self.parse)

Scraping dynamic content using python-Scrapy

4 Answers4

Linked

Related