1

I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)

After I had set up the grid engine, I ran through the samples in the guide.

Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?

Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?

What changes can be made to the grid-engine-tools to make it work?

EDIT

The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.

Howard Davis
  • 312
  • 4
  • 10

3 Answers3

1

May I state the definition of the problem and you can let me know if I understood it correctly, as both Matt and I provided the exact same solution and somehow it doesn't seem sufficient.

Problem Definition

  • You have an Order defining the start of a task to process some data.
  • The processing of data would be split among several compute nodes, each producing a resulting file stored on GS directories.
  • The goal is:

    1. Collect the files from GS bucket (that were produced by each of the nodes),
    2. Archive the collection of files as one file,
    3. Then compress that archive, and
    4. Push it back to a different GS location.

Let me know if I summarized it properly,

Thanks, Paul

Paul Grosu
  • 61
  • 2
  • Well, I believe since we'd be using tar on several files, each process won't be split, instead they'd be done on a single node. The part where we feel a cluster is necessary is that we'd have several different zip operations going on at a time. A single order requires a few files to be zipped, there may be multiple orders being processed at any given time. – Howard Davis Jul 28 '16 at 17:54
  • Well, then the easiest is to make a subdirectory for each order for organizing concurrent event. In any case, it's good to know how subdirectories really work in Google Cloud Storage, since the path is a key, and the object is its value. Here's a link that details everything: https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork – Paul Grosu Jul 30 '16 at 18:24
0

So there are many ways to do it, but the thing is that you cannot directly compress on Google Storage a collection of files - or a directory - into one file, and would need to perform the tar/gzip combination locally before transferring it.

If you want you can have the data compressed automatically via:

gsutil cp -Z

Which is detailed at the following link:

https://cloud.google.com/storage/docs/gsutil/commands/cp#changing-temp-directories

And the nice thing is that you retrieve uncompressed results from compressed data on Google Storage, because it has the ability to perform Decompressive Transcoding:

https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding

You will notice on the last line in the following script:

https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh

The following line will basically copy the current compressed file to Google Cloud Storage:

gcs_util::upload "${WS_OUT_DIR}/*" "${OUTPUT_PATH}/"

What you will need is to first perform the tar/zip on the files in the local scratch directory, and then gsutil copy the compressed file over to Google Storage, but make sure that all the files that need to be compressed are in the scratch directory before starting to compress them. Most likely you would need to SSH copy (scp) them to one of the nodes (i.e. master), and then have the master tar/gzip the whole directory before sending it over to Google Storage. I am assuming each GCE instance has its own scratch disk, but the "gsutil cp" transfer is very fast when working on GCE.

Since Google Storage is fast at data transfers with Google Compute instances, the easiest second option to pursue is to mark out lines 66-69 in the do_compress.sh file:

https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh

This way no compression happens, but the copy happens on the last line via gsutil::upload, in order to have all the uncompressed files transferred to the same Google Storage bucket. Then using "gsutil cp" from the master node you would copy them back locally, in order to compress them locally via tar/gz and then copy the compressed directory file back to the bucket using "gsutil cp".

Hope it helps but it's tricky, Paul

Paul Grosu
  • 61
  • 2
  • I appreciate the intent to help. We will likely need to compress on the cluster nodes and upload the resulting gzip to storage. The idea is so that simultaneous zips can occur at the same time. – Howard Davis Jul 26 '16 at 00:52
0
  • Are the files in question in Cloud Storage?
  • Are the files in question on a local or network drive?

In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.

The tar utility will create an archive file it can compress it as well. For example:

$ # Create a directory with a few input files
$ mkdir myfiles
$ echo "This is file1" > myfiles/file1.txt
$ echo "This is file2" > myfiles/file2.txt

$ # (C)reate a compressed archive
$ tar cvfz archive.tgz myfiles/*
a myfiles/file1.txt
a myfiles/file2.txt

$ # (V)erify the archive
$ tar tvfz archive.tgz 
-rw-r--r--  0 myuser mygroup      14 Jul 20 15:19 myfiles/file1.txt
-rw-r--r--  0 myuser mygroup      14 Jul 20 15:19 myfiles/file2.txt

To extract the contents use:

$ # E(x)tract the archive contents
$ tar xvfz archive.tgz 
x myfiles/file1.txt
x myfiles/file2.txt

UPDATE:

In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.

However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.

Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.

A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.

There is an example that does something similar here:

https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress

This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.

Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.

-Matt

  • The files that are to be compressed together are in cloud storage, in different folders. After compressing, the single zip/gzip would be sent back into another cloud storage folder – Howard Davis Jul 21 '16 at 17:48
  • I've updated my question to be more specific. Ultimately we will want to use tar in the process, but I still feel that due to the many operations that are likely to be simultaneous, that a cluster is still necessary. The whole cluster operation is very new for me, and there's absolutely minimal information on elasticluster and grid-engine-tools. – Howard Davis Jul 26 '16 at 00:41