I have a list of HDFS zip file URLs and I want to open the each file inside RDD map function instead of using binaryFiles function.
Initially, I tried like below:
def unzip(hdfs_url):
# read the hdfs file using hdfs python client
rdd = spark.sparkContext.parallelize(list_of_hdfs_urls, 16) # make 16 partitions
rdd.map(lambda a: unzip(a))
But later I realized that this wouldn't give data locality, even though it runs parallelly across the cluster.
Is there any way to run the map function for a file url x
on the node where hdfs file x
is located, how to make spark aware of this locality.
I want to read zip files in this manner to get better performance in pyspark, and hence I can avoid file serialization and de-serialization between python and java process on each executor.