4

I am trying to crawl websites for data and when I get to a page for 18+, I get a warning page. My crawler normally works on most reddit pages and I can get the data successfully. I tried using selenium to move onto the next page, which when it opens the browser is successful, but the crawler doesnt follow to that page. Below is my code..

class DarknetmarketsSpider(scrapy.Spider):
    name = "darknetmarkets"
    allowed_domains = ["https://www.reddit.com"]
    start_urls = (
        'http://www.reddit.com/r/darknetmarkets',
    )
    rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=False),)
    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get('http://www.reddit.com/r/darknetmarkets')
        #self.driver.get('https://www.reddit.com/over18?dest=https%3A%2F%2Fwww.reddit.com%2Fr%2Fdarknetmarketsnoobs')

        while True:
            try:
                YES_BUTTON = '//button[@value="yes"]'
                next = self.driver.find_element_by_xpath(YES_BUTTON).click()


                url = 'http://www.reddit.com/r/darknetmarkets'

                next.click()
            except:
                break

        self.driver.close()


        item = darknetItem()
        item['url'] = []
        for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
            item['url'].append(link.url)
            print link

The css of the button..

<button class="c-btn c-btn-primary" type="submit" name="over18" value="yes">continue</button>
Anekdotin
  • 1,531
  • 4
  • 21
  • 43

1 Answers1

2

I see that you are trying to bypass the age restriction screen in that subreddit. After you click on the button "continue" that choice is saved as a cookie so you have to retrive into scrapy.

After you click on it with Selenium save the cookies and send them to scrapy

Code courtesy of scrapy authentication login with cookies

class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = ['http://reddit.com/']

def get_cookies(self):
    self.driver = webdriver.Firefox()
    base_url = "http://www.reddit.com/r/darknetmarkets/"
    self.driver.get(base_url)
    self.driver.find_element_by_xpath("//button[@value='yes']").click()
    cookies = self.driver.get_cookies()
    self.driver.close()
    return cookies

def parse(self, response):
    yield scrapy.Request("http://www.reddit.com/r/darknetmarkets/",
        cookies=self.get_cookies(),
        callback=self.darkNetPage)
Community
  • 1
  • 1
Rafael Almeida
  • 5,142
  • 2
  • 20
  • 33