0

I am running a pyspark job on AWS-EMR and I got the following error:

IOError: [Errno 20] Not a directory: '/mnt/yarn/usercache/hadoop/filecache/12/my_common-9.0-py2.7.egg/my_common/data_tools/myData.yaml'

Does anyone know what I might have missed? Thanks!

Edamame
  • 23,718
  • 73
  • 186
  • 320

2 Answers2

1

I've run into this recently when I switched my Python Spark application from Client deploy mode to Cluster deploy mode.

My workaround is to locate the ZIP file (the artifact that I fed to spark-submit using --py-files):

  CURRENT_FILE_PATH = os.path.dirname(__file__)
  print("[DEBUG] CURRENT_FILE_PATH=" + CURRENT_FILE_PATH)

It comes out something like this:

/mnt2/yarn/usercache/task/appcache/application_1638998214637_0019/container_1638998214637_0019_02_000001/something.zip

then I can use something like:

  import zipfile
  archive = zipfile.ZipFile(CURRENT_FILE_PATH, 'r')
  json_bytes = archive.read('myfile.json')
  json_string = json.loads(json_bytes)

Note: I first tried using pkg_resources but couldn't read in the resulting JSON due to TypeError from json.loads():

import pkg_resources
json_data = pkg_resources.resource_stream(__name__, 'myfile.json')

See also PySpark: how to resolve path of a resource file present inside the dependency zip file

Aleksey Tsalolikhin
  • 1,518
  • 7
  • 14
0

As the error states my_common-9.0-py2.7.egg is not a directory.

Are you missing space in your path?

/mnt/yarn/usercache/hadoop/filecache/12/my_common-9.0-py2.7.egg /my_common/data_tools/myData.yaml

Ronak Patel
  • 3,819
  • 1
  • 16
  • 29