1

I am trying to input the file to scrapy for processing. But I don't know why I am getting problem giving input in the file format. Here is what I tried:

with open("url.txt","r") as f:

    DOMAIN = [u.strip() for u in f.readlines()]
    print DOMAIN
    URL = 'http://%s' % DOMAIN

class MySpider(scrapy.Spider):
    name = "emailextractor"
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

The input file is in this format:

emaxple.com
example.net
example.org.... etc

How to give input to scrapy in the format that I am using. I am trying to append the http:// to all the URL I will feed. Even the file is extremely large in GB. So What is the best thing I should do? Kindly, help me.
This question didn't work for me: Pass input file to scrapy containing a list of domains to be scraped

Community
  • 1
  • 1
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139

1 Answers1

0

If you want to generate requests based on URLs from a file (or something else that you can't set directly in your start_urls list), you have to override scrapy.Spider's start_requests method in your own spider.

In this method you have to generate requests for the URLs you've read from the input file:

class MySpider(scrapy.Spider):
    name = "emailextractor"

    def start_requests(self):
        with open('urls.txt') as urls_file:
            for url in urls_file:
                yield scrapy.Request(url.strip(), callback=self.parse)

    def parse(self, response):
        # parse the pages that your spider downloaded and extract the data
Valdir Stumm Junior
  • 4,568
  • 1
  • 23
  • 31