The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.
From the Spark docs:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory")
, textFile("/my/directory/*.txt")
, and textFile("/my/directory/*.gz")
.
So in your case you should be able to open all those files as a single RDD using something like this:
rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")
Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the *
and ?
wildcards.
For example:
rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")
Briefly, what this does is:
- The
*
matches all strings, so in this case all gz
files in all folders under 201412??
will be loaded.
- The
?
matches a single character, so 201412??
will cover all days in December 2014 like 20141201
, 20141202
, and so forth.
- The
,
lets you just load separate files at once into the same RDD, like the random-file.txt
in this case.
Some notes about the appropriate URL scheme for S3 paths:
- If you're running Spark on EMR, the correct URL scheme is
s3://
.
- If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer,
s3a://
is the way to go.
s3n://
has been deprecated on the open source side in favor of s3a://
. You should only use s3n://
if you're running Spark on Hadoop 2.6 or older.