1

I Need to download monthly Open Library Data Dumps files, they are some big files:

https://openlibrary.org/data/ol_dump_authors_latest.txt.gz

https://openlibrary.org/data/ol_dump_works_latest.txt.gz

https://openlibrary.org/data/ol_dump_editions_latest.txt.gz

It hangs downloading at worker and edition file because they are big files,the problem i dont get no exception that connection failed.It Just stops Downloading,i know that because the file size wont change for hours

First Try

dump_url = "https://openlibrary.org/data/ol_dump_editions_latest.txt.gz"
dump_path =  "temp_file/ol_dump_editions_latest.txt.gz"
session = requests.Session()
    with session.get(dump_url, stream=True) as r:
        r.raise_for_status()
        with open(dump_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024*1024): 
                f.write(chunk)

Second Try

dump_url = "https://openlibrary.org/data/ol_dump_editions_latest.txt.gz"
dump_path =  "temp_file/ol_dump_editions_latest.txt.gz"
session = requests.Session()
with session.get(dump_url, stream=True) as r:
    r.raise_for_status()
    with open(dump_path, 'wb') as f:
        shutil.copyfileobj(r.raw, f)
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Edin.A
  • 21
  • 5
  • Does this answer your question? [Download large file in python with requests](https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests) – Ivan Feb 01 '23 at 19:58
  • well there is nothing new there iv tried everything they suggested in that question, iv seen it.im interested why code dosent raise an error when it stops downloading – Edin.A Feb 02 '23 at 18:02
  • i wrote the solutuin here https://github.com/psf/requests/issues/6346 because stackverflow dosnt let me to annswer – Edin.A Feb 04 '23 at 10:42
  • Do you have any reason to believe this is actually a bug in requests instead of a bug in the server that you're working around? That seems like a pretty big leap. (If it's not a bug in requests, why use their issue tracker?) – Charles Duffy Feb 05 '23 at 12:08
  • i chose the issue with option i need help,is not really an issue in requests,its just a question.You are right this is a server bug, the files are saved achriver.org that is very slow. – Edin.A Feb 05 '23 at 15:12
  • The solution you posted on that ticket is a good one, btw. What error does Stack Overflow give when it doesn't let you use that same answer here? I could post it for you and click the Community Wiki button so I'm not getting any credit, but that would still mean _you_ would get credit only for the question and not its answer; so it'd be better if we could figure out what's going wrong so you could post the answer here and take credit for it yourself. – Charles Duffy Feb 05 '23 at 15:52
  • you can post it for me no problem,so if someone needs it they can use it.No error in stackoverflow when i staret i didnt know the rules of stackoverflow,im banned from answering questions :P – Edin.A Feb 05 '23 at 15:56

1 Answers1

1

This answer's core content is written by @Edin.A, and is taken from a GitHub ticket they wrote with their permission. Formatting and prose has been slightly modified, but other than reducing log verbosity, code is unchanged.


This can be solved by passing requests a timeout= argument, and then making a new request after a ConnectionError caused by that timeout. Note the max_d_r_c limit used to prevent an endless loop:

import requests 
from requests.exceptions import ConnectionError
import os

def resume_download_ol_dump_editions_latest(dump_url,dump_path,max_d_r_c):
    max_download_resumes = 30
    if max_d_r_c < max_download_resumes:
        max_d_r_c += 1
        with open(dump_path, 'ab') as f:
            position = f.tell()
            pos_header = {"Range": f"bytes={position}-"}
        
        with requests.Session() as s:
            try:
                with s.get(dump_url,headers=pos_header,stream=True,allow_redirects=True,timeout=300) as r:
                    r.raise_for_status()
                    with open(dump_path, 'ab') as f:
                        for chunk in r.iter_content(chunk_size=1024*1024):
                            f.write(chunk)
                            f.flush()
                            os.fsync(f.fileno())
            except ConnectionError as to:
                resume_download_ol_dump_editions_latest(dump_url=dump_url,dump_path=dump_path,max_d_r_c=max_d_r_c)

def download_ol_dump_editions_latest(dump_url,dump_path):
    max_download_resumes_count = 0
    with requests.Session() as s:
        try:
            with s.get(dump_url,stream=True,allow_redirects=True,timeout=300) as r:
                r.raise_for_status()
                with open(dump_path, 'wb') as f:
                    last_file_size = None
                    for chunk in r.iter_content(chunk_size=1024*1024):
                        f.write(chunk)
                        f.flush()
                        os.fsync(f.fileno())
        except ConnectionError as to:
            resume_download_ol_dump_editions_latest(dump_url=dump_url,dump_path=dump_path,max_d_r_c=max_download_resumes_count)

dump_url = "https://openlibrary.org/data/ol_dump_editions_latest.txt.gz"
dump_path =  "temp_file/ol_dump_editions_latest.txt.gz"
download_ol_dump_editions_latest(dump_url=dump_url, dump_path=dump_path)
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441