run pyspark locally

Question

I tried to follow the instructions in this book:

Large Scale Machine Learning with Python

It uses a VM image to run Spark involving Oracle VM VirtualBox and Vagrant. I almost managed to get the VM working but am blocked by not having the permission to switch virtualization on in the BIOS (I do not have the password and doubt my employer's IT department will allow me to switch this on). See also discussion here.

Anyway, what other options do I have to play with sparkpy locally (have installed it locally)? My first aim is to get this Scala code:

scala> val file = sc.textFile("C:\\war_and_peace.txt")
scala> val warsCount = file.filter(line => line.contains("war"))
scala> val peaceCount = file.filter(line => line.contains("peace"))
scala> warsCount.count()
res0: Long = 1218
scala> peaceCount.count()
res1: Long = 128

running in Python. Any pointers would be very much appreciated.

If your employer's doesn't allow you to use the necessary tools how are you supposed to get the job done? — Pedro Lobito, Apr 17 '17 at 12:24
@PedroLobito - I will try to push for this ... but in the meantime I would like to get something working ... — cs0815, Apr 17 '17 at 12:25

score 7 · Accepted Answer · answered Apr 17 '17 at 12:30

So you can setup Spark with the python and scala shells on windows, but the caveat is that in my experience performance on windows is inferior to that of osx and linux. If you want to go the route of setting everything up on windows I made a short write up of the instructions for this not too long ago that you can check out here. I am pasting the text below just incase I ever move the file from that repo or the links breaks for some other reason.

Download and Extract Spark

Download latest release of spark from apache. Be aware that it is critical that you get the right Hadoop binaries for the version of spark you choose. See section on Hadoop binaries below. Extract with 7-zip.

Install Java and Python

Install latest version of 64-bit Java. Install Anaconda3 Python 3.5, 64-bit (or other version of your choice) for all users. Restart server.

Test Java and Python

Open command line and type java -version. If it is installed properly you will see an output like this: java version "1.8.0_121" Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) Type either python or python --version. The first will open the python shell after showing the version information. The second will show only version information similar to this: Python 3.5.2 :: Anaconda 4.2.0 (64-bit)

Download Hadoop binary for Windows 64-bit

You likely don't have Hadoop installed on windows, but spark will deep within its core look for this file and possibly other binaries. Thankfully a Hadoop contributor has compiled these and has a repository with binaries for Hadoop 2.6. These binaries will work for spark version 2.0.2, but will not work with 2.1.0. To use spark 2.1.0 download the binaries from here.

The best tactic for this is to clone the repo and keep the Hadoop folder corresponding to your version of spark and add the hadoop-%version% folder to your path as HADOOP_HOME.

Add Java and Spark to Environment

Add the path to java and spark as environment variables JAVA_HOME and SPARK_HOME respectively.

Test pyspark

In command line, type pyspark and observe output. At this point spark should start in the python shell.

Setup pyspark to use Jupyter notebook

Instructions for using interactive python shells with pyspark exist within the pyspark code and can be accessed through your editor. In order to use the Jupyter notebook before launching pyspark type the following two commands:

set PYSPARK_DRIVER_PYTHON=jupyter set PYSPARK_DRIVER_PYTHON_OPTS='notebook' Once those variables are set pyspark will launch in the Jupyter notebook with the default SparkContext initialized as sc and the SparkSession initialized as spark. ProTip: Open http://127.0.0.1:4040 to view the spark UI which includes lots of useful information about your pipeline and completed processes. Any additional notebooks open with spark running will be in consecutive ports i.e. 4041, 4042, etc...

The jist is that getting the right versions of the Hadoop binary files for your version of spark is critical. The rest is making sure your path and environment variable are properly configured.

a link to the instructions in the pyspark code would be nice — Dave, Aug 08 '18 at 20:53
@Dave It's currently lines 27-31 of the `pyspark` script in `spark/bin`. You can find it in the [Spark source code for the pyspark script](https://github.com/apache/spark/blob/master/bin/pyspark#L27-L31). Locally it isn't always so straightforward. At my work we have multiple versions of spark. The `pyspark` found in my path (`/usr/bin/pyspark`) is not the `pyspark` (either `/usr/hdp/2.6.2.0-205/spark/bin/pyspark` or `/usr/hdp/2.6.2.0-205/spark2/bin/pyspark`) that ends up getting called and serves as a pass along so you have to follow it a bit to find the real script. — Grr, Aug 09 '18 at 13:55
I guess the benefit for those on windows 10 is running this on unix using WSL. I'll try later today. — Umar.H, Oct 09 '20 at 15:14