Trying to read the s3 data from java spark context:
"mapreduce.input.fileinputformat.input.dir.recursive", "true"
jsc.textFile(filePath);
It was working for me when I have only files insides hour folders:
s3://<year>/<month>/<day>/<hour>/<files>
jsc.textFile("s3://<year>/<month>/<day>");
Now, in S3 parallel to hour folder we may have new_folder as well
s3://<year>/<month>/<day>/<hour>/<files>
s3://<year>/<month>/<day>/<hour>/<new_folder>/<files>
Below code ignoring the files under new_folder
jsc.textFile("s3://<year>/<month>/<day>");
Tried with multiple regular expression, but my method "isPathExist" always return false
jsc.textFile("s3n://<year>/<month>/<day>/*/<regular_expression>");
Checked S3 path using below method, which returns false
private static boolean isPathExists(String folderPath, String bucket, String accessKey, String secret) {
AWSCredentials cred = new BasicAWSCredentials(accessKey, secret);
AmazonS3 s3 = new AmazonS3Client(cred);
ObjectListing objectListing = s3
.listObjects(new ListObjectsRequest().withBucketName(bucket).withPrefix(folderPath));
return !objectListing.getObjectSummaries().isEmpty();
}