Split and copy a file from a bucket to another bucket, without downloading it locally

Question

I'd like to split and copy a huge file from a bucket (gs://$SRC_BUCKET/$MY_HUGE_FILE) to another bucket (gs://$DST_BUCKET/), but without downloading the file locally. I expect to do this using only gsutil and shell commands.

I'm looking for something with the same final behaviour as the following commands :

gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE my_huge_file_stored_locally

split -l 1000000 my_huge_file_stored_locally a_split_of_my_file_

gsutil -m mv a_split_of_my_file_* gs://$DST_BUCKET/

But, because I'm executing these actions on a Compute Engine VM with limited disk storage capacity, getting the huge file locally is not possible (and anyway, it seems like a waste of network bandwidth).

The file in this example is split by number of lines (-l 1000000), but I will accept answers if the split is done by number of bytes.

I took a look at the docs about streaming uploads and downloads using gsutil to do something like :

gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE - | split -1000000 | ...

But I can't figure out how to upload split files directly to gs://$DST_BUCKET/, without creating them locally (creating temporarily only 1 shard for the transfer is OK though).

Have you considered using Storage Transfer Service? That's exactly what it's meant for: https://cloud.google.com/storage-transfer/docs/overview#what_is — Maxim, Jun 07 '19 at 10:40
I am aware that Storage Transfer Service could be used to transfer files from a bucket to another bucket, but here I want to split the file before copying it. I can't see any option in STS (https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec) to split input files. — norbjd, Jun 07 '19 at 11:18
Understood. Is it necessary for you to manually do the split/compose? As you can just enable multi-threading for gsutil, or use parallel uploading: https://cloud.google.com/solutions/transferring-big-data-sets-to-gcp#transfer_from_colocation_or_on-premises_storage — Maxim, Jun 07 '19 at 12:13
Unfortunately, files in the destination bucket must have a maximum number of lines (1 000 000), or at least a maximum size (10MB) : fulfilling any of these 2 conditions is OK for me though. — norbjd, Jun 07 '19 at 12:21

score 4 · Accepted Answer · answered Jun 07 '19 at 18:25

4

This can't be done without downloading, but you could use range reads to build the pieces without downloading the full file at once, e.g.,

gsutil cat -r 0-10000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file1
gsutil cat -r 10001-20000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file2
...

answered Jun 07 '19 at 18:25

Mike Schwartz

11,511
1
33
36

Interesting, thanks. So it seems like splitting a file by a specific number of lines is not possible, can you confirm this? – norbjd Jun 08 '19 at 08:25
You could write some code (e.g., on GCE or App Engine) to do what you're asking for. But given the constraint of using gsutil, no, it's not possible. – Mike Schwartz Jun 08 '19 at 16:40

Split and copy a file from a bucket to another bucket, without downloading it locally

1 Answers1