0

I have a process (external to Scrapy), which generates a list of URLs to pdf documents, and a list filepaths where I want to save the each pdf.

The following explains how to pass a list of URLs to Scrapy as a command line argument, however, is there a way to pass the filepaths and ensure each pdf is saved in the filepaths provided?

I suspect I need to modify the below based on the tutorial provided in the documentation, but as I understand it the parse method is used to determine how one response is handled, and does not handle a list.

def parse(self, response):
    filename = response.url.split("/")[-2] + '.html'
    with open(filename, 'wb') as f:
        f.write(response.body)

Any suggestions?

Community
  • 1
  • 1
JSB
  • 351
  • 2
  • 24
  • Are you saving the first PDF to the first file path eand so on or do you have another scheme linking RDF to path? Maybe you could put up some pseudo code showing us the logic you want. – Steve Jan 25 '16 at 08:39

2 Answers2

1

Turned out this was a python-related question, and nothing to do with Scrapy itself. The below turned out to be the solution I was after.

# To run;    
# > scrapy runspider pdfGetter.py -a urlList=/path/to/file.txt -a pathList=/path/to/another/file.txt

import scrapy
class pdfGetter(scrapy.Spider):
    name = "pdfGetter"

    def __init__(self,urlList='',pathList=''):
        self.File=open(urlList)
        self.start_urls = [url.strip() for url in self.urlFile.readlines()]
        self.File.close()

        self.File=open(pathList)
        self.save_urls = [path.strip() for path in self.pathFile.readlines()]
        self.File.close()

    def parse(self, response):
        idx = self.start_urls.index(response.url)
        with open(self.save_urls[idx], 'wb') as f:
            f.write(response.body)    
JSB
  • 351
  • 2
  • 24
0

If I am correct you can't "crawl" a pdf with scrapy, but if you want to save pdfs, you don't need to crawl it, you just need the url, so for example something like:

import urllib
from scrapy import Spider

class MySpider(Spider):
    name = "myspider"
    start_urls = ['http://website-that-contains-pdf-urls']

    def parse(self, response):
        urls = response.xpath('//xpath/to/url/@href').extract()
        for url in urls:
            urllib.urlretrieve(url, filename="name-of-my-file.pdf")
eLRuLL
  • 18,488
  • 9
  • 73
  • 99