System: Windows 10, Python 2.7.15, Scrapy 1.5.1
Goal: Retrieve text from within html markup for each of the link items on the target website, including those revealed (6 at a time) via the '+ SEE MORE ARCHIVES' button.
Target Website: https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info
Initial Progress: Python and Scrapy successfully installed. The following code...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
urls = [
'https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...successfully produces the following results (when -o to .csv)...
href,eventtype,eventmonth,eventdate,eventyear
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-08-02,Competitive Standard Constructed League, August ,2, 2018
/en/articles/archive/mtgo-standings/pauper-constructed-league-2018-08-01,Pauper Constructed League, August ,1, 2018
/en/articles/archive/mtgo-standings/competitive-modern-constructed-league-2018-07-31,Competitive Modern Constructed League, July ,31, 2018
/en/articles/archive/mtgo-standings/pauper-challenge-2018-07-30,Pauper Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/legacy-challenge-2018-07-30,Legacy Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-07-30,Competitive Standard Constructed League, July ,30, 2018
However, the spider will not touch any of the the info buried by the Ajax button. I've done a fair amount of Googling and digesting of documentation, example articles, and 'help me' posts. I am under the impression that to get the spider to actually see the ajax-buried info, that I need to simulate some sort of request. Variously, the correct type of request might be something to do with XHR, a scrapy FormRequest, or other. I am simply too new to web archetecture in general to be able to surmise the answer.
I hacked together a version of the initial code that calls a FormRequest, which seems to be able to still reach the initial page just fine, yet incrementing the only parameter that appears to change (when inspecting the xhr calls sent out when physically clicking the button on the page) does not appear to have an effect. That code is here...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
for i in range(1,10):
yield scrapy.FormRequest(url='https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info', formdata={'l':'en','f':'9041','search-result-theme':'','limit':'6','fromDate':'','toDate':'','event_format':'0','sort':'DESC','word':'','offset':str(i*6)}, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...and the results are the same as before, except the 6 output lines are repeated, as a block, 9 extra times.
Can anyone help point me to what I am missing? Thank you in advance.
Postscript: I always seem to get heckled out of my chair whenever I seek help for coding problems. If I am doing something wrong, please have mercy on me, I will do whatever I can to correct it.