0

I am currently using findspark to get a spark context within my jupyter notebook. To my understanding this method only supports RDD's, it does not support spark data frames or sparkSQL.

I have followed the instructions from the most liked post on this thread How do I run pyspark with jupyter notebook?

but after changing the environment variables pyspark fails to start, even in bash. Before changing the environment variables in the post, I made an AMI and rolled it back. Pyspark currently works in Bash.

I noticed someone else commented to use docker. https://hub.docker.com/r/jupyter/all-spark-notebook/

Currently my system Running ubuntu 18.04 on EC2. I installed Apache Spark with linux-brew. Jupyter and findspark are installed within a Conda environment.

The goal is to have a pyspark (rspark, any spark) kernel on jupyter that can support all libraries from Apache Spark. I would like to run spark with on one machine so I can develop and test code for low cost. I have used aws elastic map reduce for a more scalable solution, and intend on using that after building the scripts on my single node spark machine (to keep cost low).

a few questions:

  1. Is my goal feasible, or is there a better way to obtain the same results? (eg. just use aws Elastic Map Reduce with minimal hardware, or just stick to VIM and bash for pyspark)
  2. Would I be better off using Docker (https://hub.docker.com/r/jupyter/all-spark-notebook/), even though I have never used docker? Would it be good for my future career?
  3. If Docker is the better choice, would I use ec2 ubuntu 18.04, or another amazon service like ecs.
  4. Is there just a small step that I am missing to get the pyspark kernel working in my jupyter notebook?

Some other info - SPARK_HOME is not set on my environment, I had to pass the path to the constructor of findspark ie. findspark.init('/home/ubuntu/.linuxbrew/Cellar/apache-spark/2.4.5/libexec')

Thank you very much for your time, I hope the question was appropriate and detailed enough

daniel blanco
  • 60
  • 1
  • 10
  • This might be a simple way to approach it: [how to integrate pyspark on jupyter notebook](https://stackoverflow.com/questions/39088189/how-to-integrate-pyspark-on-jupyter-notebook). Just look at what OP did and what the answer suggests. Might be just what you need. – ernest_k May 29 '20 at 06:05
  • I think your issue can be summarized by (1) _"I installed Apache Spark with linux-brew ... on Ubuntu"_ (2) you did not read the Spark documentation (3) _"pyspark fails to start"_. Try downloading the official Spark-with-Hadoop runtime, then play with `pyspark` shell with different settings in `spark-env.sh` and `spark-defaults.conf` Finally find a tutorial about how to configure a Jupyter kernel, especially a PySpark kernel (you can override the default Spark conf there to get different flavors) – Samson Scharfrichter May 29 '20 at 06:22
  • @SamsonScharfrichter pyspark works in bash. My issue is that I can not get jupyter to use the pyspark kernel. The only time it failed was after following the instructions for configuring jupyter kernel in the thread I posted. Before I made those changes I made an AMI, so I just rolled it back. pyspark works fine in bash – daniel blanco May 29 '20 at 06:29
  • https://www.sicara.ai/blog/2017-05-02-get-started-pyspark-jupyter-notebook-3-minutes – Samson Scharfrichter May 29 '20 at 06:56
  • Clearly, what you need is a tutorial ==> use Google to find these; and they won't be on StackOverflow. – Samson Scharfrichter May 29 '20 at 07:02
  • PS: a "PySpark kernel" means that Jupyter shows a PySpark entry in the list of available kernels, based on a config file `[system or user dir that Jypyter scans]/arbitrary-name/kernel.json` where the JSON defines the kernel label, env variables, startup command. That's not what you are trying to do i.e. either launch manually Jupyter from a pyspark shell, or launch manually the Spark runtime from inside Jupyter in a Python kernel session. – Samson Scharfrichter May 29 '20 at 07:07

0 Answers0