pyspark: IOError: [Errno 20] Not a directory

Question

I am running a pyspark job on AWS-EMR and I got the following error:

IOError: [Errno 20] Not a directory: '/mnt/yarn/usercache/hadoop/filecache/12/my_common-9.0-py2.7.egg/my_common/data_tools/myData.yaml'

Does anyone know what I might have missed? Thanks!

share more info: command, spark job or input/output arguments. — Ronak Patel, Jul 15 '16 at 18:16

score 1 · Answer 1 · answered Dec 09 '21 at 17:55

I've run into this recently when I switched my Python Spark application from Client deploy mode to Cluster deploy mode.

My workaround is to locate the ZIP file (the artifact that I fed to spark-submit using --py-files):

  CURRENT_FILE_PATH = os.path.dirname(__file__)
  print("[DEBUG] CURRENT_FILE_PATH=" + CURRENT_FILE_PATH)

It comes out something like this:

/mnt2/yarn/usercache/task/appcache/application_1638998214637_0019/container_1638998214637_0019_02_000001/something.zip

then I can use something like:

  import zipfile
  archive = zipfile.ZipFile(CURRENT_FILE_PATH, 'r')
  json_bytes = archive.read('myfile.json')
  json_string = json.loads(json_bytes)

Note: I first tried using pkg_resources but couldn't read in the resulting JSON due to TypeError from json.loads():

import pkg_resources
json_data = pkg_resources.resource_stream(__name__, 'myfile.json')

See also PySpark: how to resolve path of a resource file present inside the dependency zip file

score 0 · Answer 2 · answered Jul 15 '16 at 18:16

0

As the error states my_common-9.0-py2.7.egg is not a directory.

Are you missing space in your path?

/mnt/yarn/usercache/hadoop/filecache/12/my_common-9.0-py2.7.egg /my_common/data_tools/myData.yaml

answered Jul 15 '16 at 18:16

Ronak Patel

3,819
1
16
29

pyspark: IOError: [Errno 20] Not a directory

2 Answers2