I am trying to crawl websites for data and when I get to a page for 18+, I get a warning page. My crawler normally works on most reddit pages and I can get the data successfully. I tried using selenium to move onto the next page, which when it opens the browser is successful, but the crawler doesnt follow to that page. Below is my code..
class DarknetmarketsSpider(scrapy.Spider):
name = "darknetmarkets"
allowed_domains = ["https://www.reddit.com"]
start_urls = (
'http://www.reddit.com/r/darknetmarkets',
)
rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=False),)
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get('http://www.reddit.com/r/darknetmarkets')
#self.driver.get('https://www.reddit.com/over18?dest=https%3A%2F%2Fwww.reddit.com%2Fr%2Fdarknetmarketsnoobs')
while True:
try:
YES_BUTTON = '//button[@value="yes"]'
next = self.driver.find_element_by_xpath(YES_BUTTON).click()
url = 'http://www.reddit.com/r/darknetmarkets'
next.click()
except:
break
self.driver.close()
item = darknetItem()
item['url'] = []
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item['url'].append(link.url)
print link
The css of the button..
<button class="c-btn c-btn-primary" type="submit" name="over18" value="yes">continue</button>