Scrapy returns same piece of information 80+ times

Question

New to scrapy and python and running into an issue here.

I'm trying to get the entire list of PS3 games from Metacritic. Here is my code:

class MetacriticSpider(BaseSpider):
name = "metacritic"
allowed_domains = ["metacritic.com"]
max_id = 10
start_urls = [
    "http://www.metacritic.com/browse/games/title/ps3?page="
    #"http://www.metacritic.com/browse/games/title/xbox360?page=0"
]

def start_requests(self):
    for c in lowercase:
        for i in range(self.max_id):
            yield Request('http://www.metacritic.com/browse/games/title/ps3/{0}?page={1}'.format(c, i), callback = self.parse)

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[@class="product_wrap"]/div')
    items = []

    for site in sites:
        #item = MetacriticItem()
        #titles = site.xpath('a/text()').extract()
        titles = site.xpath('//div[contains(@class, "basic_stat product_title")]/a/text()').extract()
        #cscore = site.xpath('//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()
        if titles:
            item = MetacriticItem()
            item['title'] = titles[0].strip()       

            items.append(item)
    return items

For some reason when I check the JSON file, I have 81 instances of each title, and it is starting on Assassin's Creed: Revelations - Ancestors Character Pack

It should be starting on the first page which is numbered titles, then progressing to the A list, and checking each page in that etc. Any ideas on why it is doing it this way, I can't see what my problem is

alecxe · Accepted Answer · 2014-03-25T17:51:44.987

2

Your xpath should be relative (.//) to the each site:

titles = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()

Also, change sites selection xpath to (note, no div at the end):

//div[@class="product_wrap"]

edited Mar 25 '14 at 17:51

answered Mar 25 '14 at 16:55

alecxe

462,703
120
1,088
1,195

"relevant" -> "relative"? – paul trmbrth Mar 25 '14 at 17:51
@alecxe thanks that worked. However I'm still not sure why it isn't going through the pages properly. Its still starting at Assasins Creed and starting on the page 2 of everything. Any ideas? I want it to start on the numbered page and progress from there – Ayohaych Mar 25 '14 at 19:05
1

@AndyOHart well, as I see it: you are still getting all of the data you need, from all pages. The order though is not guaranteed due to scrapy's asynchronous nature. See http://stackoverflow.com/questions/16875580/the-order-of-scrapy-crawling-urls-with-long-start-urls-list-and-urls-yiels-from?lq=1. – alecxe Mar 25 '14 at 19:15
You were right I was getting everything just not in order, but I'm not getting the numbered content on this page: http://www.metacritic.com/browse/games/title/ps3?view=condensed – Ayohaych Mar 25 '14 at 19:41

Scrapy returns same piece of information 80+ times

1 Answers1