I am currently using findspark to get a spark context within my jupyter notebook. To my understanding this method only supports RDD's, it does not support spark data frames or sparkSQL.
I have followed the instructions from the most liked post on this thread How do I run pyspark with jupyter notebook?
but after changing the environment variables pyspark fails to start, even in bash. Before changing the environment variables in the post, I made an AMI and rolled it back. Pyspark currently works in Bash.
I noticed someone else commented to use docker. https://hub.docker.com/r/jupyter/all-spark-notebook/
Currently my system Running ubuntu 18.04 on EC2. I installed Apache Spark with linux-brew. Jupyter and findspark are installed within a Conda environment.
The goal is to have a pyspark (rspark, any spark) kernel on jupyter that can support all libraries from Apache Spark. I would like to run spark with on one machine so I can develop and test code for low cost. I have used aws elastic map reduce for a more scalable solution, and intend on using that after building the scripts on my single node spark machine (to keep cost low).
a few questions:
- Is my goal feasible, or is there a better way to obtain the same results? (eg. just use aws Elastic Map Reduce with minimal hardware, or just stick to VIM and bash for pyspark)
- Would I be better off using Docker (https://hub.docker.com/r/jupyter/all-spark-notebook/), even though I have never used docker? Would it be good for my future career?
- If Docker is the better choice, would I use ec2 ubuntu 18.04, or another amazon service like ecs.
- Is there just a small step that I am missing to get the pyspark kernel working in my jupyter notebook?
Some other info - SPARK_HOME is not set on my environment, I had to pass the path to the constructor of findspark ie. findspark.init('/home/ubuntu/.linuxbrew/Cellar/apache-spark/2.4.5/libexec')
Thank you very much for your time, I hope the question was appropriate and detailed enough