File input to scrapy giving issues

Question

I am trying to input the file to scrapy for processing. But I don't know why I am getting problem giving input in the file format. Here is what I tried:

with open("url.txt","r") as f:

    DOMAIN = [u.strip() for u in f.readlines()]
    print DOMAIN
    URL = 'http://%s' % DOMAIN

class MySpider(scrapy.Spider):
    name = "emailextractor"
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

The input file is in this format:

emaxple.com
example.net
example.org.... etc

How to give input to scrapy in the format that I am using. I am trying to append the http:// to all the URL I will feed. Even the file is extremely large in GB. So What is the best thing I should do? Kindly, help me.
This question didn't work for me: Pass input file to scrapy containing a list of domains to be scraped

Valdir Stumm Junior · Accepted Answer · 2017-02-10T11:30:04.130

If you want to generate requests based on URLs from a file (or something else that you can't set directly in your start_urls list), you have to override scrapy.Spider's start_requests method in your own spider.

In this method you have to generate requests for the URLs you've read from the input file:

class MySpider(scrapy.Spider):
    name = "emailextractor"

    def start_requests(self):
        with open('urls.txt') as urls_file:
            for url in urls_file:
                yield scrapy.Request(url.strip(), callback=self.parse)

    def parse(self, response):
        # parse the pages that your spider downloaded and extract the data

File input to scrapy giving issues

1 Answers1