I saw a few discussions on this but couldn't quite understand the right solution: I want to load a couple hundred files from S3 into an RDD. Here is how I'm doing it now:
ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().
withBucketName(...).
withPrefix(...));
List<String> keys = new LinkedList<>();
objectListing.getObjectSummaries().forEach(summery -> keys.add(summery.getKey())); // repeat while objectListing.isTruncated()
JavaRDD<String> events = sc.parallelize(keys).flatMap(new ReadFromS3Function(clusterProps));
The ReadFromS3Function
does the actual reading using the AmazonS3
client:
public Iterator<String> call(String s) throws Exception {
AmazonS3 s3Client = getAmazonS3Client(properties);
S3Object object = s3Client.getObject(new GetObjectRequest(...));
InputStream is = object.getObjectContent();
List<String> lines = new LinkedList<>();
String str;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
if (is != null) {
while ((str = reader.readLine()) != null) {
lines.add(str);
}
} else {
...
}
} finally {
...
}
return lines.iterator();
I kind of "translated" this from answers I saw for the same question in Scala. I think it's also possible to pass the entire list of paths to sc.textFile(...)
, but I'm not sure which is the best-practice way.