I've run into this recently when I switched my Python Spark application from Client deploy mode to Cluster deploy mode.
My workaround is to locate the ZIP file (the artifact that I fed to spark-submit
using --py-files
):
CURRENT_FILE_PATH = os.path.dirname(__file__)
print("[DEBUG] CURRENT_FILE_PATH=" + CURRENT_FILE_PATH)
It comes out something like this:
/mnt2/yarn/usercache/task/appcache/application_1638998214637_0019/container_1638998214637_0019_02_000001/something.zip
then I can use something like:
import zipfile
archive = zipfile.ZipFile(CURRENT_FILE_PATH, 'r')
json_bytes = archive.read('myfile.json')
json_string = json.loads(json_bytes)
Note: I first tried using pkg_resources but couldn't read in the
resulting JSON due to TypeError from json.loads():
import pkg_resources
json_data = pkg_resources.resource_stream(__name__, 'myfile.json')
See also PySpark: how to resolve path of a resource file present inside the dependency zip file