0

Trying to read the s3 data from java spark context:

"mapreduce.input.fileinputformat.input.dir.recursive", "true"
jsc.textFile(filePath);

It was working for me when I have only files insides hour folders:

s3://<year>/<month>/<day>/<hour>/<files>
jsc.textFile("s3://<year>/<month>/<day>");

Now, in S3 parallel to hour folder we may have new_folder as well

s3://<year>/<month>/<day>/<hour>/<files>
s3://<year>/<month>/<day>/<hour>/<new_folder>/<files>

Below code ignoring the files under new_folder

jsc.textFile("s3://<year>/<month>/<day>");

Tried with multiple regular expression, but my method "isPathExist" always return false

jsc.textFile("s3n://<year>/<month>/<day>/*/<regular_expression>");

Checked S3 path using below method, which returns false

private static boolean isPathExists(String folderPath, String bucket, String accessKey, String secret) {
    AWSCredentials cred = new BasicAWSCredentials(accessKey, secret);
    AmazonS3 s3 = new AmazonS3Client(cred);
    ObjectListing objectListing = s3
            .listObjects(new ListObjectsRequest().withBucketName(bucket).withPrefix(folderPath));
    return !objectListing.getObjectSummaries().isEmpty();
}
Kapil
  • 11
  • 4

1 Answers1

0

If you want all subdirectories, then use two stars.

jsc.textFile("s3://<year>/<month>/<day>/**");

And files in those directories, one more star (I think)

jsc.textFile("s3://<year>/<month>/<day>/**/*");
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • This doesn't work for me. The only way I found is to list all the object using s3 API and then to give as input to spark the list of files. – Erica Oct 10 '18 at 10:16
  • @nicola my answer was based on https://stackoverflow.com/questions/31782763/how-to-use-regex-to-include-exclude-some-input-files-in-sc-textfile – OneCricketeer Oct 10 '18 at 13:39