I've written a script in python using scrapy to collect the name of different posts and their links from a website. When I execute my script from command line it works flawlessly. Now, my intention is to run the script using CrawlerProcess()
. I look for the similar problems in different places but nowhere I could find any direct solution or anything closer to that. However, when I try to run it as it is I get the following error:
from stackoverflow.items import StackoverflowItem ModuleNotFoundError: No module named 'stackoverflow'
This is my script so far (stackoverflowspider.py
):
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy
class stackoverflowspider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = StackoverflowItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(stackoverflowspider)
c.start()
items.py
includes:
import scrapy
class StackoverflowItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
This is the tree: Click to see the hierarchy
I know I can bring up success this way but I am only interested to accomplish the task with the way I tried above:
def parse(self,response):
for link in sel.xpath("//*[@class='question-hyperlink']"):
name = link.xpath('.//text()').extract_first()
url = link.xpath('.//@href').extract_first()
yield {"Name":name,"Link":url}