0

i'm using a jupyter notebook with sparkmagic extension, but i can only access the spark cluster by create a pyspark kernel. The conflict is that i can't use the py3 environment(some installed python package) in pyspark kernel, either i can't use spark context in python3 kernel.

enter image description here

i don't know how to introduce packages in sparkmagic, so can i use pyspark that actually implement by sparkmagic in py3? or are there any other opinions?

  • I think you are mixing app, where code actually runs with sparkmagic and spark. In PySpark kernel each cell each submitted automatically to the spark cluster via `livy` api. There is a `%%local` magic to run code on your machine, e.g. for visualizing on result or result analysis. Remotly submitted code, cannot use your local env. It cannot use packages installed on your machine. I wrote some hints in the answer how to provide tensorflow to the cluster nodes – dre-hh Dec 23 '19 at 17:12
  • The difference between the 2 kernels provided by `sparkmagic` is, that `PySpark `submit cells by default to a spark cluster and `Ipython` submits only cells with `%%spark` magic first line to the remote cluster. Check the examples https://github.com/jupyter-incubator/sparkmagic/tree/master/examples – dre-hh Dec 23 '19 at 17:16

1 Answers1

2

Both kernels - PySpark and default IPython can be used with python3 interpreter on pyspark. It can be specified in ~/.sparkmagic/config.json. This is standard spark configuration and will be just passed by sparkmagic to the livy server running on the spark master node.

  "session_configs": {
    "conf": {
      "spark.pyspark.python":"python3"
     }
   }

spark.pyspark.python Python binary executable to use for PySpark in both driver and executors.

python3 is in this case available as command on the PATH of each node in the spark cluster. You can install it also into a custom directory on each node and specify the full path. "spark.pyspark.python":"/Users/hadoop/python3.8/bin/python"

All spark conf options can be passed like that.

Thera are 2 ways for importing tensorflow:

  • install on all spark machines (master and workers) via python3 -m pip install tensorflow
  • zip, upload and pass the remote path through sparkmagic via spark.submit.pyFiles setting. Accepts a path on s3, hdfs or the master node file system (not a path on your machine)

See answer about --py-files

dre-hh
  • 7,840
  • 2
  • 33
  • 44
  • that's very helpful of your explain! i may use the IPython kernel which can store dataframe in memory and using it in local tensorflow. if that works, my needs are met. – fresh learning Dec 24 '19 at 07:48