Cluster terminates but works locally

Question

I am trying to deploy a spark job (using pyspark librairies : ML) on aws EMR. I want to create a simple cluster with a single instance, to understand how EMR works.

I create the cluster with the console with the following configuration :

spark-submit --deploy-mode cluster s3://bucket/key/file.py

My step fails with a bunch of error logs that I struggle to understand besides this on :

  File "PowerProdPredictionEmr.py", line 261
df = df.select("Perimetre", *target_exprs, *window_exprs, "rn")

SyntaxError: invalid syntax

Which I don't understand since it's working locally on my machine.

Here is the code :

...
window_exprs = [df.power_prod[i] for i in range(w*sample_week)]
df = df.select("Perimetre", *target_exprs, *window_exprs, "rn")
...

Any idea ? I can add other log files if necessary.

You use Python 3 locally, but Python 2 remotely. In Python 2, not named arguments cannot follow varargs and you can only have one `*` expression. This would be portable: `df.select("Perimetre", "rn", *target_exprs + window_exprs)` -assuming `target_exprs` and `window_exprs` are lists. — user10938362, Mar 15 '19 at 18:12
Thanks a lot, that was the issue. EMR has 2.x as default python version, do you know how to set it to 3.x when creating the cluster ? I saw : https://aws.amazon.com/fr/premiumsupport/knowledge-center/emr-pyspark-python-3x/ but don't know how to run the sh command on all executors — César Bouyssi, Mar 18 '19 at 12:34
Standard way (`PYSPARK_(DRIVER)_PYTHON`) placed in EMR configuration should work just fine - https://stackoverflow.com/a/40717927/10938362 — user10938362, Mar 18 '19 at 12:46

score 0 · Accepted Answer · answered Mar 19 '19 at 09:20

As @user10938362 pointed out, even though EMR supports python up to version 3.6, version 2.x is the default one installed on instances.

To set up python 3 as default version, you can add the following code in the "Edit Software / Enter configuration".

[
  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
          }
       }
    ]
  }
]

All python versions issues will be solved.

Cluster terminates but works locally

1 Answers1