4

I'd like to split and copy a huge file from a bucket (gs://$SRC_BUCKET/$MY_HUGE_FILE) to another bucket (gs://$DST_BUCKET/), but without downloading the file locally. I expect to do this using only gsutil and shell commands.

I'm looking for something with the same final behaviour as the following commands :

gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE my_huge_file_stored_locally

split -l 1000000 my_huge_file_stored_locally a_split_of_my_file_

gsutil -m mv a_split_of_my_file_* gs://$DST_BUCKET/

But, because I'm executing these actions on a Compute Engine VM with limited disk storage capacity, getting the huge file locally is not possible (and anyway, it seems like a waste of network bandwidth).

The file in this example is split by number of lines (-l 1000000), but I will accept answers if the split is done by number of bytes.

I took a look at the docs about streaming uploads and downloads using gsutil to do something like :

gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE - | split -1000000 | ...

But I can't figure out how to upload split files directly to gs://$DST_BUCKET/, without creating them locally (creating temporarily only 1 shard for the transfer is OK though).

norbjd
  • 10,166
  • 4
  • 45
  • 80
  • Have you considered using Storage Transfer Service? That's exactly what it's meant for: https://cloud.google.com/storage-transfer/docs/overview#what_is – Maxim Jun 07 '19 at 10:40
  • I am aware that Storage Transfer Service could be used to transfer files from a bucket to another bucket, but here I want to split the file before copying it. I can't see any option in STS (https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec) to split input files. – norbjd Jun 07 '19 at 11:18
  • Understood. Is it necessary for you to manually do the split/compose? As you can just enable multi-threading for gsutil, or use parallel uploading: https://cloud.google.com/solutions/transferring-big-data-sets-to-gcp#transfer_from_colocation_or_on-premises_storage – Maxim Jun 07 '19 at 12:13
  • Unfortunately, files in the destination bucket must have a maximum number of lines (1 000 000), or at least a maximum size (10MB) : fulfilling any of these 2 conditions is OK for me though. – norbjd Jun 07 '19 at 12:21

1 Answers1

4

This can't be done without downloading, but you could use range reads to build the pieces without downloading the full file at once, e.g.,

gsutil cat -r 0-10000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file1
gsutil cat -r 10001-20000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file2
...
Mike Schwartz
  • 11,511
  • 1
  • 33
  • 36
  • Interesting, thanks. So it seems like splitting a file by a specific number of lines is not possible, can you confirm this? – norbjd Jun 08 '19 at 08:25
  • You could write some code (e.g., on GCE or App Engine) to do what you're asking for. But given the constraint of using gsutil, no, it's not possible. – Mike Schwartz Jun 08 '19 at 16:40