Python scrapy reddit confirm button breaks crawl

Question

I am trying to crawl websites for data and when I get to a page for 18+, I get a warning page. My crawler normally works on most reddit pages and I can get the data successfully. I tried using selenium to move onto the next page, which when it opens the browser is successful, but the crawler doesnt follow to that page. Below is my code..

class DarknetmarketsSpider(scrapy.Spider):
    name = "darknetmarkets"
    allowed_domains = ["https://www.reddit.com"]
    start_urls = (
        'http://www.reddit.com/r/darknetmarkets',
    )
    rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=False),)
    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get('http://www.reddit.com/r/darknetmarkets')
        #self.driver.get('https://www.reddit.com/over18?dest=https%3A%2F%2Fwww.reddit.com%2Fr%2Fdarknetmarketsnoobs')

        while True:
            try:
                YES_BUTTON = '//button[@value="yes"]'
                next = self.driver.find_element_by_xpath(YES_BUTTON).click()


                url = 'http://www.reddit.com/r/darknetmarkets'

                next.click()
            except:
                break

        self.driver.close()


        item = darknetItem()
        item['url'] = []
        for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
            item['url'].append(link.url)
            print link

The css of the button..

<button class="c-btn c-btn-primary" type="submit" name="over18" value="yes">continue</button>

Selenium and scrapy don't communicate with each other unless you specifically send from one to another. Once selenium crawls all websites, all the data is lost — Rafael Almeida, May 03 '16 at 15:33
Ahh interesting, is it possible to tackle this problem in scrapy then I guess is my question. — Anekdotin, May 03 '16 at 15:57

score 2 · Answer 1 · edited May 23 '17 at 12:34

I see that you are trying to bypass the age restriction screen in that subreddit. After you click on the button "continue" that choice is saved as a cookie so you have to retrive into scrapy.

After you click on it with Selenium save the cookies and send them to scrapy

Code courtesy of scrapy authentication login with cookies

class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = ['http://reddit.com/']

def get_cookies(self):
    self.driver = webdriver.Firefox()
    base_url = "http://www.reddit.com/r/darknetmarkets/"
    self.driver.get(base_url)
    self.driver.find_element_by_xpath("//button[@value='yes']").click()
    cookies = self.driver.get_cookies()
    self.driver.close()
    return cookies

def parse(self, response):
    yield scrapy.Request("http://www.reddit.com/r/darknetmarkets/",
        cookies=self.get_cookies(),
        callback=self.darkNetPage)

Python scrapy reddit confirm button breaks crawl

1 Answers1