I wrote a script to download PDF files from wikipedia. I implemented a loop to go over all the URLs I want to download (I have them in a .csv file). The first few files are getting downloaded pretty quick (no wonder, they only are like 200kB in size), but after a while, the downloads need longer and longer. It feels like some exponential growth in my loop, which makes me loop way slower after every iteration. Maybe the request isn't closed the right way or something, I really do not know.
Could someone please help me make this code less bad and more effective?
urls
and titles
both are lists. They get passed from the same function, so I might just transform them into a dictionary.
def getPDF(urls, titles, path):
i = 0
for i in range(len(urls) - 1):
i += 1
with open(path + '/{}.pdf'.format(titles[i]), 'wb') as f:
with requests.session() as s:
r = s.get(urls[i])
f.write(r.content)
print('{}.pdf'.format(titles[i]) + ' downloaded!')
EDIT:
It must have something to do with the request. I added a function that prints out the time the download took (from the first line in getPDF()
to the print()
line. These are the results:
Downloads werden gestartet, das Programm beendet automatisch...
Wirtschaft.pdf downloaded! (2.606057643890381sec)
Wirtschaftseinheit.pdf downloaded! (1.41001296043396sec)
Planung.pdf downloaded! (1.6632893085479736sec)
Bedürfnis#In den Wirtschaftswissenschaften.pdf downloaded! (1.4947214126586914sec)
Unternehmen.pdf downloaded! (2.317748546600342sec)
Privathaushalt.pdf downloaded! (122.32739114761353sec)
%C3%96ffentlicher Haushalt.pdf downloaded! (2.03417706489563sec)
Absatzwirtschaft.pdf downloaded! (0.8923726081848145sec)
Produktion.pdf downloaded! (0.2800614833831787sec)
Tausch.pdf downloaded! (1.5359272956848145sec)
Konsum.pdf downloaded! (121.9487988948822sec)
Entsorgungswirtschaft.pdf downloaded! (121.20771074295044sec)
Gut (Wirtschaftswissenschaft).pdf downloaded! (245.15847492218018sec)
Fertig!
Note: I put this in code so it gets formatted, I hope that is alright.
It's pretty obvious that you get something like a 'strike' after 4 requests and then need to wait 2 minutes, at the end you got stroke immediately and even had to wait 4 minutes for the next requests. This would mean that the question is not related to "downloading big files", but more to "how to download lots of very small files?".
I guess the question now should be: Does anyone know how much delay you need to add in order to fix this? And do you agree with me that the 'lag' must be caused by sending too many requests in too little time?