Spark: parallelize hdfs URLs with data locality awarness

Question

I have a list of HDFS zip file URLs and I want to open the each file inside RDD map function instead of using binaryFiles function.

Initially, I tried like below:

def unzip(hdfs_url):
  # read the hdfs file using hdfs python client

rdd = spark.sparkContext.parallelize(list_of_hdfs_urls, 16) # make 16 partitions
rdd.map(lambda a: unzip(a))

But later I realized that this wouldn't give data locality, even though it runs parallelly across the cluster.

Is there any way to run the map function for a file url x on the node where hdfs file x is located, how to make spark aware of this locality.

I want to read zip files in this manner to get better performance in pyspark, and hence I can avoid file serialization and de-serialization between python and java process on each executor.

hope this link may be usefull https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark/28636651#28636651. Also do you use container format with your zip compression? Means Avro, SequenceFile or Parquet — VB_, Nov 06 '19 at 23:04
pay attention to file splitability. gzip + xml mean the data inside files aren't splittable, and must be uncompressed all at once. For example, if you have HDFS block size of 128 MB and file size of 1 GB, than the file will be splitted in ~8 blocks. But you can't parallelize that file processing between 8 workers, because all 8 blocks should be uncompressed all at once (by single worker) — VB_, Nov 06 '19 at 23:42
In most cases, it's recommended to choose splittable compression algorithm (i.e.LZO) or not-splittable compression algorithm in conjuction with container data format (i.e. Avro, SequentialFile or Parquet). Any container data format is splittable. For example, Parquet file compressed with Snappy will be splittable. So 1GB Parquet file could be simultaneously read by 8 workers — VB_, Nov 06 '19 at 23:44

Spark: parallelize hdfs URLs with data locality awarness

0 Answers0