1

With large files I get various errors that stops the download, so I want to resume from where it stopped by appending to the file on disk properly.

I saw that the FileIO has to be using 'ab' mode:

fh = io.FileIO(fname, mode='ab')

but I couldn't find how to specify where to continue from using MediaIoBaseDownload.

Any idea on how to implement this?

Joan Venge
  • 315,713
  • 212
  • 479
  • 689

2 Answers2

2

When I saw your question, I thought that this thread might be useful. Ref I have posted my answer to this thread.

In order to achieve the partial download from Google Drive, the property of Range: bytes=500-999 is required to be included in the request header. But, unfortunately, in the current stage, MediaIoBaseDownload cannot use this property. When MediaIoBaseDownload is used, all data is downloaded.

So, in order to achieve your goal, it is required to use a workaround. In this workaround, I proposed the following flow.

  1. Retrieve the filename and file size of the file on the Google Drive you want to download.
  2. Check the existing file by filename.
    • When there is no existing file, the file is downloaded as a new file.
    • When there is an existing file, the file is downloaded as a resumable download.
  3. Download the file content by requests.

When this flow is reflected in a sample script of python, it becomes as follows.

Sample script:

service = build("drive", "v3", credentials=creds) # Here, please use your client.
file_id = "###" # Please set the file ID of the file you want to download.

access_token = creds.token # Acces token is retrieved from creds of service = build("drive", "v3", credentials=creds)

# Get the filename and file size.
obj = service.files().get(fileId=file_id, fields="name,size").execute()
filename = obj.get("name", "sampleName")
size = obj.get("size", None)
if not size:
    sys.exit("No file size.")
else:
    size = int(size)

# Check existing file.
file_path = os.path.join("./", filename) # Please set your path.
o = {}
if os.path.exists(file_path):
    o["start_byte"] = os.path.getsize(file_path)
    o["mode"] = "ab"
    o["download"] = "As resume"
else:
    o["start_byte"] = 0
    o["mode"] = "wb"
    o["download"] = "As a new file"
if o["start_byte"] == size:
    sys.exit("The download of this file has already been finished.")

# Download process
print(o["download"])
headers = {
    "Authorization": f"Bearer {access_token}",
    "Range": f'bytes={o["start_byte"]}-',
}
url = f"https://www.googleapis.com/drive/v3/files/{file_id}?alt=media"
with requests.get(url, headers=headers, stream=True) as r:
    r.raise_for_status()
    with open(file_path, o["mode"]) as f:
        for chunk in r.iter_content(chunk_size=10240):
            f.write(chunk)
  • When this script is run, a file of file_id is downloaded. When the downloaded is stopped in the middle of downloading, when you run the script again, the download is run as the resume. By this, the file content is appended to the existing file. I thought that this might be your expected situation.

  • In this script, please load the following modules. And also, please load the required modules for retrieving service = build("drive", "v3", credentials=creds).

    import os.path
    import requests
    import sys
    

Note:

  • In this case, it supposes that the download file is not Google Docs files (Document, Spreadsheet, Slides, and so on). Please be careful about this.

  • This script supposes that your client service = build("drive", "v3", credentials=creds) can be used for downloading the file from Google Drive. Please be careful about this.

References:

Tanaike
  • 181,128
  • 11
  • 97
  • 165
  • Thanks it seems to work. Now I have to figure out how to get the progress using this new method because with mediaIO I was doing this: status.progress(). Any ideas? – Joan Venge Feb 28 '23 at 07:08
  • 1
    @Joan Venge Thank you for replying. In order to use this script, the modules of `import os.path`, `import requests` and `import sys` are also required to be loaded. I added it to my answer. Please confirm it. – Tanaike Feb 28 '23 at 07:08
  • 1
    @Joan Venge If you want to confirm the download process, how about putting the confirmation script in the loop of `for chunk in r.iter_content(chunk_size=10240):`? In this case, I thought that these threads might be useful. https://stackoverflow.com/q/37573483 and https://stackoverflow.com/q/20801034 – Tanaike Feb 28 '23 at 07:18
  • thanks a lot yes the import works. I added some basic progress using a similar method I will check your links. but the size 10240, is it 1MB? because it was printing a lot of messages for each chunk so i print every 100mb for example. Also is 10240 an optional amount of data to use for requests? I assume it is because MediaIO was using 100MB chunks that's why I was wondering if using such big chunks could be an issue? – Joan Venge Feb 28 '23 at 07:32
  • also last question: do you know if comparing file size on disk to google drive file size is enough to see if the file has been downloaded already? because it's either you dont have the file on disk or you have it. But if you have the file on disk, that could either be partially downloaded or fully downloaded, that's what I want to detect. i am also checking in another case if the file is an old version using mod date and download again. – Joan Venge Feb 28 '23 at 07:36
  • 1
    @Joan Venge Thank you for replying. About `10240`, I have tested my script using this value. In this case, I think that you can modify this value. For example, this thread might be useful. https://stackoverflow.com/q/46205586 – Tanaike Feb 28 '23 at 07:36
  • RE: checking file size, I am getting different values on google drive and file on disk if they are fully downloaded, so have to double check it again. that's why i asked if it was reliable. – Joan Venge Feb 28 '23 at 07:40
  • 1
    @Joan Venge About `also last question`, in that case, for example, how about checking the value of md5Checksum? When this is reflected in my script, please modify `fields="name,size"` to `fields="name,size,md5Checksum"`. By this, you can retrieve it of the file on Google Drive as `md5Checksum = obj.get("md5Checksum")`. I think that this can be used for comparing the file on a local PC and Google Drive. As another hash, it seems that `sha1Checksum` and `sha256Checksum` can be used. – Tanaike Feb 28 '23 at 07:42
  • 1
    Thanks a lot man I didnt know about this, I will try these today. Thanks a lot again man! – Joan Venge Feb 28 '23 at 07:44
  • 1
    @Joan Venge As additional information, I have created a CLI tool including the resumable download before. [Ref](https://github.com/tanaikech/goodls) In this CLI tool, I check the downloaded file using the value of MD5 hash. By this, the correct file can be checked. – Tanaike Feb 28 '23 at 07:46
  • Thanks a lot I will check your tool also. Btw I noticed after some time the download just "finished" but it didn't download the full file. The file size is 300GB and it managed to download like 12GB and then bail without an error message. I guess this is common? – Joan Venge Feb 28 '23 at 12:23
  • @Joan Venge About your situation of `Btw I noticed after some time the download just "finished" but it didn't download the full file. The file size is 300GB and it managed to download like 12GB and then bail without an error message. I guess this is common?`, in my current situation, I cannot test it. So, I cannot answer clearly. I apologize for this. In this case, how about posting it as a new question? I think that users who have the same situation might post an answer or comment. I apologize that I have no clear answer for this situation, again. – Tanaike Feb 28 '23 at 12:40
  • No problem, I will try to post it, it's the reproducible code bit that will be hard, because it happens with large files. – Joan Venge Feb 28 '23 at 17:11
  • Unfortunately there is another error I started getting: Daily Limit for Unauthenticated Use Exceeded. Continued use requires signup. But the thing is I already have credentials.json and I am using this for the creds.token, so not sure why I get this error as unauthorized. – Joan Venge Feb 28 '23 at 19:39
  • @Joan Venge About `Daily Limit for Unauthenticated Use Exceeded.`, are these threads useful? https://stackoverflow.com/search?q=%5Bgoogle-drive-api%5D+Daily+Limit+for+Unauthenticated+Use+Exceeded&s=54cfd887-195c-49ea-85f0-0c45c0d97a92 – Tanaike Feb 28 '23 at 22:55
0

I cannot see your code, so I’ll provide you some general information on some options that can help you solve the issue. You can implement downloading the file in Chunks using MediaIoBaseDownload, you can see some documentation about this here.

Example:

  request = farms.animals().get_media(id='cow')
  fh = io.FileIO('cow.png', mode='wb')
  downloader = MediaIoBaseDownload(fh, request, chunksize=1024*1024)

  done = False
  while done is False:
    status, done = downloader.next_chunk()
    if status:
      print "Download %d%%." % int(status.progress() * 100)
  print "Download Complete!"

Get the next chunk of the download.

Args: num_retries: Integer, number of times to retry with randomized exponential backoff. If all retries fail, the raised HttpError represents the last request. If zero (default), we attempt the request only once.

Returns: (status, done): (MediaDownloadProgress, boolean) The value of 'done' will be True when the media has been fully downloaded or the total size of the media is unknown.

Raises: googleapiclient.errors.HttpError if the response was not a 2xx. httplib2.HttpLib2Error if a transport error has occurred.

I also found this example in the Google documentation here.

from __future__ import print_function

import io

import google.auth
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from googleapiclient.http import MediaIoBaseDownload


def download_file(real_file_id):
    """Downloads a file
    Args:
        real_file_id: ID of the file to download
    Returns : IO object with location.

    Load pre-authorized user credentials from the environment.
    TODO(developer) - See https://developers.google.com/identity
    for guides on implementing OAuth2 for the application.
    """
    creds, _ = google.auth.default()

    try:
        # create drive api client
        service = build('drive', 'v3', credentials=creds)

        file_id = real_file_id

        # pylint: disable=maybe-no-member
        request = service.files().get_media(fileId=file_id)
        file = io.BytesIO()
        downloader = MediaIoBaseDownload(file, request)
        done = False
        while done is False:
            status, done = downloader.next_chunk()
            print(F'Download {int(status.progress() * 100)}.')

    except HttpError as error:
        print(F'An error occurred: {error}')
        file = None

    return file.getvalue()


if __name__ == '__main__':
    download_file(real_file_id='1KuPmvGq8yoYgbfW74OENMCB5H0n_2Jm9')

Lastly, you can review several examples on how to use MediaIoBaseDownload with chunks in these 2 blogs.

  1. Python googleapiclient.http.MediaIoBaseDownload() Examples
  2. googleapiclient.http.MediaIoBaseDownload

Update

Partial download functionality is provided by many client libraries via a Media Download service. You can refer to the client library documentation for details here and here. However, the documentation is not very clear.

The API client library for Java has more information an states that:

"The resumable media download protocol is similar to the resumable media upload protocol, which is described in the Google Drive API documentation."

In the Google Drive API documentation you will find some examples using python for resumable upload. You can use the documentation of the Python google-resumable-media library, the Java resumable media download, and the resumable upload as base for the code to restart the upload once it fails.

Giselle Valladares
  • 2,075
  • 1
  • 4
  • 13
  • Thx i have similar code to this already. But this doesn't show how to resume download of a partially downloaded file on disk. So if the download is interrupted i want to check if the file size is different on disk and continue downloading the rest of the file appending to it. – Joan Venge Feb 27 '23 at 21:46
  • Sorry, my bad I forgot to add that part of the answer. I'm updating the information right now. – Giselle Valladares Feb 27 '23 at 22:53