Read files along with subdirectories from s3 using sc.textFile

Question

Trying to read the s3 data from java spark context:

"mapreduce.input.fileinputformat.input.dir.recursive", "true"
jsc.textFile(filePath);

It was working for me when I have only files insides hour folders:

s3://<year>/<month>/<day>/<hour>/<files>
jsc.textFile("s3://<year>/<month>/<day>");

Now, in S3 parallel to hour folder we may have new_folder as well

s3://<year>/<month>/<day>/<hour>/<files>
s3://<year>/<month>/<day>/<hour>/<new_folder>/<files>

Below code ignoring the files under new_folder

jsc.textFile("s3://<year>/<month>/<day>");

Tried with multiple regular expression, but my method "isPathExist" always return false

jsc.textFile("s3n://<year>/<month>/<day>/*/<regular_expression>");

Checked S3 path using below method, which returns false

private static boolean isPathExists(String folderPath, String bucket, String accessKey, String secret) {
    AWSCredentials cred = new BasicAWSCredentials(accessKey, secret);
    AmazonS3 s3 = new AmazonS3Client(cred);
    ObjectListing objectListing = s3
            .listObjects(new ListObjectsRequest().withBucketName(bucket).withPrefix(folderPath));
    return !objectListing.getObjectSummaries().isEmpty();
}

What is the output you are getting ? Are you getting any exception ? please provide more details — RBanerjee, Apr 04 '17 at 11:36

score 0 · Answer 1 · answered Apr 05 '17 at 03:39

0

If you want all subdirectories, then use two stars.

jsc.textFile("s3://<year>/<month>/<day>/**");

And files in those directories, one more star (I think)

jsc.textFile("s3://<year>/<month>/<day>/**/*");

answered Apr 05 '17 at 03:39

OneCricketeer

179,855
19
132
245

This doesn't work for me. The only way I found is to list all the object using s3 API and then to give as input to spark the list of files. – Erica Oct 10 '18 at 10:16
@nicola my answer was based on https://stackoverflow.com/questions/31782763/how-to-use-regex-to-include-exclude-some-input-files-in-sc-textfile – OneCricketeer Oct 10 '18 at 13:39

Read files along with subdirectories from s3 using sc.textFile

1 Answers1