I have a mapPartitions
on an RDD and within each partition, a resource file has to be opened. This module that contains the method invoked by mapPartitions
and the resource file is passed on to each executor using the --py-files
argument as a zip file.
To make it clear:
rdd = rdd.mapPartitions(work_doing_method)
def work_doing_method(rows):
for row in rows:
resource_file_path = os.path.join(os.path.dirname(__file__), "resource.json")
with open(resource_file_path) as f:
resource = json.loads(f.read())
...
When I do this after passing the zip file which includes all of this using the --py-file
parameter to the spark-submit command,
I get IOError: [Errno 20] Not a directory:/full/path/to/the/file/within/zip/file
I do not understand how Spark uses the zip file to read the dependencies. The os.path.dirname
utility returns the full path including the zip file, for eg. /spark/dir/my_dependency_file.zip/path/to/the/resource/file
. I believe this should be the problem. I tried many combinations to resolve the path of the file. Any help is appreciated.
Thanks!