How can I optimize this for-with loop?

Question

I wrote a script to download PDF files from wikipedia. I implemented a loop to go over all the URLs I want to download (I have them in a .csv file). The first few files are getting downloaded pretty quick (no wonder, they only are like 200kB in size), but after a while, the downloads need longer and longer. It feels like some exponential growth in my loop, which makes me loop way slower after every iteration. Maybe the request isn't closed the right way or something, I really do not know.

Could someone please help me make this code less bad and more effective? urls and titles both are lists. They get passed from the same function, so I might just transform them into a dictionary.

    def getPDF(urls, titles, path):
    i = 0
    for i in range(len(urls) - 1):
        i += 1
        with open(path + '/{}.pdf'.format(titles[i]), 'wb') as f:
            with requests.session() as s:
                r = s.get(urls[i])
                f.write(r.content)
        print('{}.pdf'.format(titles[i]) + ' downloaded!')

EDIT: It must have something to do with the request. I added a function that prints out the time the download took (from the first line in getPDF() to the print() line. These are the results:

Downloads werden gestartet, das Programm beendet automatisch...

Wirtschaft.pdf downloaded! (2.606057643890381sec)
Wirtschaftseinheit.pdf downloaded! (1.41001296043396sec)
Planung.pdf downloaded! (1.6632893085479736sec)
Bedürfnis#In den Wirtschaftswissenschaften.pdf downloaded! (1.4947214126586914sec)
Unternehmen.pdf downloaded! (2.317748546600342sec)
Privathaushalt.pdf downloaded! (122.32739114761353sec)
%C3%96ffentlicher Haushalt.pdf downloaded! (2.03417706489563sec)
Absatzwirtschaft.pdf downloaded! (0.8923726081848145sec)
Produktion.pdf downloaded! (0.2800614833831787sec)
Tausch.pdf downloaded! (1.5359272956848145sec)
Konsum.pdf downloaded! (121.9487988948822sec)
Entsorgungswirtschaft.pdf downloaded! (121.20771074295044sec)
Gut (Wirtschaftswissenschaft).pdf downloaded! (245.15847492218018sec)

Fertig!

Note: I put this in code so it gets formatted, I hope that is alright.

It's pretty obvious that you get something like a 'strike' after 4 requests and then need to wait 2 minutes, at the end you got stroke immediately and even had to wait 4 minutes for the next requests. This would mean that the question is not related to "downloading big files", but more to "how to download lots of very small files?".

I guess the question now should be: Does anyone know how much delay you need to add in order to fix this? And do you agree with me that the 'lag' must be caused by sending too many requests in too little time?

Not the problem, but you shouldn't iterate over `range(len(something))`. Iterate over the thing itself; in this case you should zip together urls and titles. `for url, title in zip(urls, titles):` — Daniel Roseman, Sep 17 '18 at 15:09
Is this something you are after - https://stackoverflow.com/a/16696317/267482? — bobah, Sep 17 '18 at 15:10
Possible duplicate of [How to download large file in python with requests.py?](https://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py) — bobah, Sep 17 '18 at 15:11
To make it look greater, you should make url, titles and path into an object, and pass as a parameter a list of that object, so your foor loop would iterate over this list, instead of using indexes. Not sure about the exponential growth tho, we would need some sample data to simulate the issue, but if i were to guess, i would say you are reopening the same file with the f, so the write, write always on the same file, with the write beeing slow instead of the get (unless you are 100% sure that the slow call is the get method). — Rodolfo Donã Hosp, Sep 17 '18 at 15:21

score 0 · Answer 1 · answered Sep 17 '18 at 15:33

The suggestions in comments are good if you encounter a large pdf file. You can get a faster result even with smaller files if your download asynchonously with scrapy

As for code aestetics you can put each individual responsibility to an individual function, eg:

def fetch(url):
    with requests.session() as s:
        r = s.get(url)
        return (r.content)

Your main worker function can be like:

def save_pdf(url, title, directory):
    filename = make_path(title, directory)
    with open(filename, 'wb') as f:
        content = fetch(url) 
        f.write(content)

And main loop control loop:

for t in targets:
   savePDF(t['url'], t['title'], directory)

Hope it helps.

How can I optimize this for-with loop?

1 Answers1