0

System: Windows 10, Python 2.7.15, Scrapy 1.5.1

Goal: Retrieve text from within html markup for each of the link items on the target website, including those revealed (6 at a time) via the '+ SEE MORE ARCHIVES' button.

Target Website: https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info

Initial Progress: Python and Scrapy successfully installed. The following code...

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        # specifies exported fields and order
        'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
    }

def start_requests(self):
    urls = [
        'https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info',
    ]

    for url in urls:
        yield Request(url=url, callback=self.parse)

def parse(self, response):
    for event in response.css('div.article-item-extended'):
        yield {
            'href': event.css('a::attr(href)').extract(),
            'eventtype': event.css('h3::text').extract(),
            'eventmonth': event.css('span.month::text').extract(),
            'eventdate': event.css('span.day::text').extract(),
            'eventyear': event.css('span.year::text').extract(),
        }

...successfully produces the following results (when -o to .csv)...

href,eventtype,eventmonth,eventdate,eventyear
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-08-02,Competitive Standard Constructed League, August ,2, 2018 
/en/articles/archive/mtgo-standings/pauper-constructed-league-2018-08-01,Pauper Constructed League, August ,1, 2018 
/en/articles/archive/mtgo-standings/competitive-modern-constructed-league-2018-07-31,Competitive Modern Constructed League, July ,31, 2018 
/en/articles/archive/mtgo-standings/pauper-challenge-2018-07-30,Pauper Challenge, July ,30, 2018 
/en/articles/archive/mtgo-standings/legacy-challenge-2018-07-30,Legacy Challenge, July ,30, 2018 
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-07-30,Competitive Standard Constructed League, July ,30, 2018 

However, the spider will not touch any of the the info buried by the Ajax button. I've done a fair amount of Googling and digesting of documentation, example articles, and 'help me' posts. I am under the impression that to get the spider to actually see the ajax-buried info, that I need to simulate some sort of request. Variously, the correct type of request might be something to do with XHR, a scrapy FormRequest, or other. I am simply too new to web archetecture in general to be able to surmise the answer.

I hacked together a version of the initial code that calls a FormRequest, which seems to be able to still reach the initial page just fine, yet incrementing the only parameter that appears to change (when inspecting the xhr calls sent out when physically clicking the button on the page) does not appear to have an effect. That code is here...

import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
    # specifies exported fields and order
    'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}

def start_requests(self):
    for i in range(1,10):
        yield scrapy.FormRequest(url='https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info', formdata={'l':'en','f':'9041','search-result-theme':'','limit':'6','fromDate':'','toDate':'','event_format':'0','sort':'DESC','word':'','offset':str(i*6)}, callback=self.parse)

def parse(self, response):
    for event in response.css('div.article-item-extended'):
        yield {
            'href': event.css('a::attr(href)').extract(),
            'eventtype': event.css('h3::text').extract(),
            'eventmonth': event.css('span.month::text').extract(),
            'eventdate': event.css('span.day::text').extract(),
            'eventyear': event.css('span.year::text').extract(),
        }

...and the results are the same as before, except the 6 output lines are repeated, as a block, 9 extra times.

Can anyone help point me to what I am missing? Thank you in advance.

Postscript: I always seem to get heckled out of my chair whenever I seek help for coding problems. If I am doing something wrong, please have mercy on me, I will do whatever I can to correct it.

1 Answers1

0

Scrapy don't render dynamic content very well, you need something else to deal with Javascript. Try these:

This blog post about scrapy + splash has a good introduction on the topic.

Lucas Wieloch
  • 818
  • 7
  • 19
  • Thank you for your response! – Justin Simpson Aug 02 '18 at 17:44
  • At the risk of asking a very ignorant question: using only the information on the Chrome inspect page window, how would I know that this is a javascript issue? – Justin Simpson Aug 02 '18 at 18:03
  • Well, in general you'll know it's javascript when content is displayed in a dynamic way. This being said, you could ask "But how do I know that some content was dynamically generated/displayed?" In the case of that page, when you click inspect on that button, and click it you'll see other html elements showing up from 'nowhere'; they weren't rendered with that page but now that the button was clicked they are. – Lucas Wieloch Aug 02 '18 at 18:11