9

I am using aws .net sdk to run a s3distcp job to EMR to concatenate all files in a folder with --groupBy arg. But whatever "groupBy" arg I have tried, it failed all the time or just copy the files without concatenating like if no --groupBy specified in the arg list.

The files in the folder is spark saveAsTextFiles named like below:

part-0000
part-0001
part-0002
...
...

step.HadoopJarStep = new HadoopJarStepConfig
            {
                Jar = "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
                Args = new List<string>
                {
                    "--s3Endpoint=s3-eu-west-1.amazonaws.com",
                    "--src=s3://foo/spark/result/bar" ,
                    "--dest=s3://foo/spark/result-merged/bar",
                    "--groupBy=(part.*)",
                    "--targetSize=256"

                }
            };
Barbaros Alp
  • 6,405
  • 8
  • 47
  • 61

1 Answers1

12

After all the struggling with this whole day, in the end I got it worked with the groupKey arg below:

--groupBy=.*part.*(\w+)

But even if I add --targetSize=1024 to args s3distcp produced 2,5MB - 3MB files. Does anyone have any idea about it?

** *UPDATE * **

Here is the groupBy clause which is concatenating all the files into one file, in their own folder:

.*/(\\w+)/.*

The last "/" is so important here --source="s3://foo/spark/result/"

There are some folders in "result" folder:

s3://foo/spark/result/foo
s3://foo/spark/result/bar
s3://foo/spark/result/lorem
s3://foo/spark/result/ipsum

and in each folder above there are hundreds of files like:

part-0000
part-0001
part-0002

.*/(\\w+)/.* this group by clause group every file in every folder so in the end you got one file for each folder with the folder name

s3://foo/spark/result-merged/foo/foo -> File
s3://foo/spark/result-merged/bar/bar -> File
s3://foo/spark/result-merged/lorem/lorem -> File
s3://foo/spark/result-merged/ipsum/ipsum -> File

So, this is the final working command for me:

s3-dist-cp --src s3://foo/spark/result/  --dest s3://foo/spark/results-merged --groupBy '.*/(\\w+)/.*' --targetSize 1024

Thanks.

Barbaros Alp
  • 6,405
  • 8
  • 47
  • 61
  • i have a very similar problem to what you had but my folder is a bit more nested..Can you please have a look at https://stackoverflow.com/questions/46833387/using-groupby-while-copying-from-hdfs-to-s3-to-merge-files-within-a-folder – Amistad Oct 19 '17 at 15:55