3

I have some big .avro files in the Google Cloud Storage and I want to concat all of them in a single file.

I got

java -jar avro-tools.jar concat

However, as my files are in the google storage path: gs://files.avro I can't concat them by using avro-tools. Any suggestion about how to solve it?

Donnald Cucharo
  • 3,866
  • 1
  • 10
  • 17

2 Answers2

1

You can use the gsutil compose command. For example:

gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite

Note: For extremely large files and/or very low per-machine bandwidth, you may want to split the file and upload it from multiple machines, and later compose these parts of the file manually.

On my case I tested it with the following values: foo.txt contains a word Hello and bar.txt contains a word World. Running this command:

gsutil compose gs://bucket/foo.txt gs://bucket/bar.txt gs://bucket/baz.txt

baz.txt would return:

Hello
World

Note: GCS does not support inter-bucket composing.

Just in case if you're encountering an exception error with regards to integrity checks, run gsutil help crcmod to get an instructions on how to fix it.

Donnald Cucharo
  • 3,866
  • 1
  • 10
  • 17
  • It's a very nice option! However, is it useful for avro? (due to headers) – Marcus Sandri Oct 12 '20 at 01:37
  • 1
    There isn't any flag yet that allows you to skip headers and for that, an alternative is you may have to write your own app to process the concatenation. Other [users](https://stackoverflow.com/questions/57591243/how-to-use-gsutil-compose-in-googleshell-and-skip-first-rows) have done it on their use case with csv. – Donnald Cucharo Oct 12 '20 at 20:23
1

Check out https://github.com/spotify/gcs-tools

Light weight wrapper that adds Google Cloud Storage (GCS) support to common Hadoop tools, including avro-tools, parquet-cli, proto-tools for Scio's Protobuf in Avro file, and magnolify-tools for Magnolify code generation, so that they can be used from regular workstations or laptops, outside of a Google Compute Engine (GCE) instance.

punkrockpolly
  • 9,270
  • 6
  • 36
  • 37