0

I've successfully implemented pause/resume in Scrapy with help from documentation (https://doc.scrapy.org/en/latest/topics/jobs.html) I can also scrape multiple pages to fill the values of one item in one csv line by adapting an example (How can i use multiple requests and pass items in between them in scrapy python) . However, I can't seem to get both functionalities to work together, so that I have a spider that scrapes from two pages for each item, and is capable of being paused and restarted.

Here is my attempt with www.beeradvocate.com as an example. urls_collection1 and urls_collection2 are list of >40,000 URLs each.

Initiate

def start_requests(self):
    urls_collection1 = pd.read_csv('urls_collection1.csv')
    #example url_collection1:  'https://www.beeradvocate.com/community/members/sammy.3853/?card=1'
    urls_collection2 = pd.read_csv('urls_collection2.csv')
    #example url_collection2:  'https://www.beeradvocate.com/user/beers/?ba=Sammy'

    for i in range(len(urls_collection1)):
        item = item()
        yield scrapy.Request(urls_collection1.iloc[i,0],callback=self.parse1, meta={'item': item})
        yield scrapy.Request(urls_collection2.iloc[i,0], callback=self.parse2, meta={'item': item})

        #To allow for pause/resume
        self.state['items_count'] = self.state.get('items_count', 0) + 1

Parse from first page

def parse1(self, response):
    item = response.meta['item']
    item['gender_age'] = response.css('.userTitleBlurb .userBlurb').xpath('text()').extract_first()  
    yield item  

Parse from second page

def parse2(self,response):
    item = response.meta['item']
    item['num_reviews'] = response.xpath('//*[@id="ba-content"]/div/b/text()[2]').extract_first()
    return item

Everything seems to work fine except that data scraped via parse 1 and parse 2 end up on different rows instead of on the same row as one item.

1 Answers1

0

Try this:

def start_requests(self):
    urls_collection1 = pd.read_csv('urls_collection1.csv')
    #example url_collection1:  'https://www.beeradvocate.com/community/members/sammy.3853/?card=1'
    urls_collection2 = pd.read_csv('urls_collection2.csv')
    #example url_collection2:  'https://www.beeradvocate.com/user/beers/?ba=Sammy'

    for i in range(len(urls_collection1)):
        item = item()
        yield scrapy.Request(urls_collection1.iloc[i,0],
                             callback=self.parse1,
                             meta={'item': item,
                                   'collection2_url': urls_collection2.iloc[i,0]})

def parse1(self, response):
    collection2_url = respones.meta['collection2_url']
    item = response.meta['item']
    item['gender_age'] = response.css('.userTitleBlurb .userBlurb').xpath('text()').extract_first()
    yield Request(collection2_url, 
                  callback=self.parse2,
                  meta={'item': item})

def parse2(self,response):
    item = response.meta['item']
    item['num_reviews'] = response.xpath('//*[@id="ba-content"]/div/b/text()[2]').extract_first()
    return item
Wilfredo
  • 1,548
  • 1
  • 9
  • 9
  • Hi, thank you for your reply. I'm afraid this now yields an empty csv file for me. Perhaps it's the way I'm calling the spider? Here's the command: scrapy crawl myScraper -o scrape_raw.csv -t csv -s JOBDIR=job_201117 – Hanu Marna Nov 21 '17 at 15:24
  • do you know if the spider is filtering any request? does it reach the `parse2` method? – Wilfredo Nov 21 '17 at 20:46
  • I tried adding 'dont_filter=True' to the collection2_url request and it worked! Thank you so much, I didn't know about automatic request filtering and would have spent hours trying to figure out what went wrong! – Hanu Marna Nov 22 '17 at 10:53