0

I learned recently that you can use wget -r -P ./pdfs -A pdf http://example.com/ to recursively download pdf files from a website. However this is not cross-platform as Windows doesn't have wget. I want to use Python to achieve the same thing. The only solutions I've seen are non-recursive - e.g. https://stackoverflow.com/a/54618327/3042018

I would also like to be able to just get the names of the files without downloading so I can check if a file has already been downloaded.

There are so many tools available in Python. What is a good solution here? Should I use one of the "mainstream" packages like scrapy or selenium or maybe just requests? Which is the most suitable for this task please, and how do I implement it?

Robin Andrews
  • 3,514
  • 11
  • 43
  • 111

1 Answers1

0

You can try several more ways, maybe you can find the right one for you. Here's an example.

If it's just a single download, you can use the following methods.

from simplified_scrapy import req, utils
res = req.get("http://example.com/xxx.pdf")
path = "./pdfs/xxx.pdf"
utils.saveResponseAsFile(res, path)

If you need to download the page first, and then extract the PDF link from the page, you can use the following method。

import os, sys
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
    name = 'download_pdf'
    start_urls = ["http://example.com/"] # Entry page

    def __init__(self):
        Spider.__init__(self, self.name)  #necessary
        if (not os.path.exists('./pdfs')):
            os.mkdir('./pdfs')

    def afterResponse(self, response, url, error=None, extra=None):
        try:
            path = './pdfs' + url[url.rindex('/'):]
            index = path.find('?')
            if index > 0: path = path[:index]
            flag = utils.saveResponseAsFile(response, path, fileType="pdf")
            if flag:
                return None
            else:  # If it's not a pdf, leave it to the frame
                return Spider.afterResponse(self, response, url, error)
        except Exception as err:
            print(err)

    def extract(self, url, html, models, modelNames):
        doc = SimplifiedDoc(html)
        lst = doc.selects('a').containsReg(".*.pdf", attr="href")
        for a in lst:
            a["url"] = utils.absoluteUrl(url.url, a["href"])
        return {"Urls": lst, "Data": None}


SimplifiedMain.startThread(MySpider())  # Start download
dabingsou
  • 2,469
  • 1
  • 5
  • 8