I have to write data as individual JPG files (~millions) from PySpark to an S3 bucket.
I tried multiple options:
setup: AWS EMR cluster and Jupyter notebook.
- Create a boto3 client in 'foreach' method and write to S3 ==> too slow and inefficient as we open the client for every task.
def get_image(y):
res = requests.get(img_url, stream=True)
file_name = "./" +str(cid) + ".jpg"
client = boto3.client('s3')
file_name = str(cid) + ".jpg"
client.put_object(Body=res.content, Bucket='test', Key='out_images/'+file_name)
myRdd.foreach(get_image)
- Write to local file system and run an "aws S3 copy" to S3 => Not clear on how to access this data if it is written to each individual worker node's volumes. Logged into worker nodes while the job is running but couldn't exactly find where the JPGs are written to.
def get_image(y):
res = requests.get(img_url, stream=True)
file_name = "./" +str(cid) + ".jpg"
with open(file_name, 'wb') as f:
f.write(res.content)
myRdd.foreach(get_image)
- Write to HDFS and run an s3-dist-cp later. Probably most efficient but haven't had success with code yet.
I get path cannot be found exceptions
def get_image(y):
res = requests.get(img_url, stream=True)
file_name = "hdfs://" +str(cid) + ".jpg"
with open(file_name, 'wb') as f:
f.write(res.content)
myRdd.foreach(get_image)
Can someone suggest a good approach to achieving this?