1

Does anyone know how I could run the same Scrapy scraper over 200 times on different websites, each with their respective output files? Usually in Scrapy, you indicate the output file when you run it from the command line by typing -o filename.json.

That1Guy
  • 7,075
  • 4
  • 47
  • 59
Matt
  • 139
  • 1
  • 1
  • 10

2 Answers2

1

multiple ways:

  • Create a pipeline to drop the items with configurable parameters, like running scrapy crawl myspider -a output_filename=output_file.txt. output_filename is added as an argument to the spider, and now you can access it from a pipeline like:

    class MyPipeline(object):
        def process_item(self, item, spider):
            filename = spider.output_filename
            # now do your magic with filename
    
  • You can run scrapy within a python script, and then also do your things with the output items.

Community
  • 1
  • 1
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • I don't see how this answers the question on how to run this script then multiple times simultaneously. – Maximilian Kohl Mar 03 '17 at 14:10
  • because you can define the output file in the command line, so you can call the same command to crawl the spider, defining different output file each time – eLRuLL Mar 03 '17 at 14:50
0

I'm doing a similar thing. Here is what I have done:

  1. Write the crawler as you normally would, but make sure to implement feed exports. I have the feed export push the results directly to an S3 bucket. Also, I recommend that you accept the website as a command line parameter to the script. (Example here)
  2. Setup scrapyd to run your spider
  3. Package and deploy your spider to scrapyd using scrapyd-client
  4. Now, with your list of websites, simply issue a single curl command per URL to your scrapyd process.

I've used the above strategy to shallow scrape two million domains, and I did it in less than 5 days.

Sam Texas
  • 1,245
  • 14
  • 30
  • Thanks for the response. This will require me to type a line per URL? I am scraping 422 websites per day, indefinitely. Would really like to be able to do automate it. See my newer post for where I'm at: http://stackoverflow.com/questions/33663877/scrapy-crawlers-not-running-simultaneously-from-python-script – Matt Nov 12 '15 at 04:52
  • You should just make a text file with one line per url. Then make a script to iterate over the file, calling the scrapyd url. – Sam Texas Nov 12 '15 at 07:44