Gzip a file in Python before uploading to Cloud Storage

Question

I have the following Python function to write the given content to a bucket in Cloud Storage:

import gzip
from google.cloud import storage

def upload_to_cloud_storage(json):
    """Write to Cloud Storage."""

    # The contents to upload as a JSON string.
    contents = json

    storage_client = storage.Client()

    # Path and name of the file to upload (file doesn't yet exist).
    destination = "path/to/name.json.gz"

    # Gzip the contents before uploading
    with gzip.open(destination, "wb") as f:
        f.write(contents.encode("utf-8"))

    # Bucket
    my_bucket = storage_client.bucket('my_bucket')

    # Blob (content)
    blob = my_bucket.blob(destination)
    blob.content_encoding = 'gzip'

    # Write to storage
    blob.upload_from_string(contents, content_type='application/json')

However, I receive an error when running the function:

FileNotFoundError: [Errno 2] No such file or directory: 'path/to/name.json.gz'

Highlighting this line as the cause:

with gzip.open(destination, "wb") as f:

I can confirm that the bucket and path both exist although the file itself is new and to be written.

I can also confirm that removing the Gzipping part sees the file successfully written to Cloud Storage.

How can I gzip a new file and upload to Cloud Storage?

Other answers I've used for reference:

If you get a `FileNotFoundError` while opening the file for writing, you most likely specified the wrong path. Check the current working directory or use an absolute path and try again — etuardu, Sep 09 '22 at 16:42
The file doesn't actually exist at that point (just the path). Following this example(https://stackoverflow.com/a/54769937/127427), I'm creating a new gzip file. Is this not the correct way to create a new gzip file? — ianyoung, Sep 09 '22 at 16:47
I meant that if you are trying to write to `./path/to/name.json.gz` and you get `No such file or directory` it most likely means that the directory `./path` and/or the directory `./path/to` does not exist — etuardu, Sep 12 '22 at 11:02

ianyoung · Accepted Answer · 2022-09-10T17:00:49.843

Although @David's answer wasn't complete at the time of solving my problem, it got me on the right track. Here's what I ended up using along with explanations I found out along the way.

import gzip

from google.cloud import storage
from google.cloud.storage import fileio 

def upload_to_cloud_storage(json_string):
    """Gzip and write to Cloud Storage."""

    storage_client = storage.Client()
    bucket = storage_client.bucket('my_bucket')

    # Filename (include path)
    blob = bucket.blob('path/to/file.json')

    # Set blog meta data for decompressive transcoding
    blob.content_encoding = 'gzip'
    blob.content_type = 'application/json'

    writer = fileio.BlobWriter(blob)

    # Must write as bytes
    gz = gzip.GzipFile(fileobj=writer, mode="wb")

    # When writing as bytes we must encode our JSON string.
    gz.write(json_string.encode('utf-8'))

    # Close connections
    gz.close()
    writer.close()

We use the GzipFile() class instead of convenience method (compress) to enable us to pass in the mode. When trying to write using w or wt you will receive the error:

TypeError: memoryview: a bytes-like object is required, not 'str'

So we must write in binary mode (wb). This will also enable the .write() method. When doing so however we need to encode our JSON string. This can be done using str.encode() and setting it as utf-8. Failing to do this will also result in the same error.

Finally, I wanted to be able to enable decompressive transcoding where the requester (browser in my case) will receive the uncompressed version of the file when requested. To enable this google.cloud.storage.blob allows you to set some meta data including content_type and content_encoding so we can can follow best practices.

This sees the JSON object in memory written to your chosen destination in Cloud Storage in a compressed format and decompressed on the fly (without needing to download a gzip archive).

Thanks also to @JohnHanley for the troubleshooting advice.

David · Answer 2 · 2022-09-10T02:09:24.717

1

The best solution is not to write the gzip to a file at all, and directly compress and stream to GCS.

from google.cloud import storage
from google.cloud.storage import fileio 

storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('my_object')
writer = fileio.BlobWriter(blob)
gz = gzip.GzipFile(fileobj=writer, mode="w")  # use "wb" if bytes
gz.write(contents)
gz.close()
writer.close()

edited Sep 10 '22 at 02:09

answered Sep 09 '22 at 16:59

David

9,288
1
20
52

Thanks @David. Is this instead of `upload_from_string()`? Can I still set the content type as `application/json` and `Content-Encoding` as `gzip` for decompressive transcoding? (https://cloud.google.com/storage/docs/transcoding#content-type_vs_content-encoding) – ianyoung Sep 09 '22 at 17:07
@ianyoung - I recommend trying David's solution to your original question. If you have another problem or question create a new question with your new code. – John Hanley Sep 09 '22 at 17:58
I'm getting `NameError: name 'BlobWriter' is not defined`. Which library does this come from? – ianyoung Sep 09 '22 at 18:07
I can now import `BlobWriter` but then get `unexpected keyword argument 'fileobject'`. If i remove `fileobject` as a named argument. I then get `TypeError: expected str, bytes or os.PathLike object, not BlobWriter`. I'm a little lost in what it is I'm supposed to be doing. I appreciate the suggestion but I could do with a little more info as to why it would be a better option. The only streaming example I can find in the docs is this one (https://cloud.google.com/storage/docs/samples/storage-stream-file-upload) but it is very different from the suggestion. Any ideas how to get it working? – ianyoung Sep 09 '22 at 18:17
@ianyoung - the first step is to review the documentation for each API that you use. That way you can look up and solve minor issues. Change `fileobject` to `fileobj`. See this link: https://docs.python.org/3/library/gzip.html#gzip.GzipFile – John Hanley Sep 09 '22 at 18:46
Thanks @JohnHanley. I feel I'm getting close. However I'm receiving another error `OSError(errno.EBADF, "write() on read-only GzipFile object")` on the `gzip` line. I have tried attaching `mode=rb` as well as `rt`, `wb`, and `wt`. All to no avail. I'm not sure why it is a read-only object and how to change that other than changing `mode`. – ianyoung Sep 09 '22 at 20:41
@ianyoung - create a new post with your current code. – John Hanley Sep 09 '22 at 22:09

Gzip a file in Python before uploading to Cloud Storage

2 Answers2