94

I am trying to setup Apache Spark on Windows.

After searching a bit, I understand that the standalone mode is what I want. Which binaries do I download in order to run Apache spark in windows? I see distributions with hadoop and cdh at the spark download page.

I don't have references in web to this. A step by step guide to this is highly appreciated.

Mukesh Ram
  • 6,248
  • 4
  • 19
  • 37
Siva
  • 1,839
  • 5
  • 21
  • 31

10 Answers10

143

Steps to install Spark in local mode:

  1. Install Java 7 or later. To test java installation is complete, open command prompt type java and hit enter. If you receive a message 'Java' is not recognized as an internal or external command. You need to configure your environment variables, JAVA_HOME and PATH to point to the path of jdk.

  2. Download and install Scala.

    Set SCALA_HOME in Control Panel\System and Security\System goto "Adv System settings" and add %SCALA_HOME%\bin in PATH variable in environment variables.

  3. Install Python 2.6 or later from Python Download link.

  4. Download SBT. Install it and set SBT_HOME as an environment variable with value as <<SBT PATH>>.

  5. Download winutils.exe from HortonWorks repo or git repo. Since we don't have a local Hadoop installation on Windows we have to download winutils.exe and place it in a bin directory under a created Hadoop home directory. Set HADOOP_HOME = <<Hadoop home directory>> in environment variable.

  6. We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.

    Set SPARK_HOME and add %SPARK_HOME%\bin in PATH variable in environment variables.

  7. Run command: spark-shell

  8. Open http://localhost:4040/ in a browser to see the SparkContext web UI.

Community
  • 1
  • 1
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
  • 2
    I get "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'". Do I need an extra step for installing hive? – Stefan Mar 31 '17 at 19:52
  • 1
    @Stefan http://stackoverflow.com/questions/42264695/installing-spark-on-windows-10-spark-hive-hivesessionstate#42290821 – Ani Menon Apr 01 '17 at 16:01
  • 5
    That's very helpful, thanks. Also, if someone has error saying "failed to find spark jars directory" on running spark-shell, make sure there is no space in your SPARK_HOME path. Struggled on this for long. – Aakash Jain Apr 10 '17 at 07:09
  • 1
    This is gold right here. I can't explain how much trouble I had with Spark and Scala in Windows. I first tried Windows Ubuntu Bash. Not a good idea! Maybe if you have the latest creators update (Ubuntu 16), but otherwise there's a ton of errors and network issues. – Tom Aug 24 '17 at 22:48
  • do you necessarily need jdk? or is jre enough? – Triamus Nov 17 '17 at 14:08
  • @Triamus JRE is enough to run code. If you want to develop in java then JDK. – Ani Menon Nov 18 '17 at 14:47
  • Also doesn't seem to work with Java 9 currently - produces a bunch of warnings and "Failed to initialize compiler: object java.lang.Object in compiler mirror not found." Seems to work if you go back to Java 8 – quarkonium Mar 19 '18 at 06:41
  • @Mahesha999 You just need `winutils` to be in PATH for Spark to work on windows (as mentioned in step 5). You can either keep it in a different folder or have it in Spark folder and add that path to environment variable `HADOOP_HOME`. If you are facing issues, do post another question. – Ani Menon Mar 20 '18 at 11:04
  • @quarkonium Java 9 support is still a "work in progress" as of today. – Ani Menon Mar 20 '18 at 11:05
  • I was trying out pyspark on windows. [This](https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c) suggests to put winutils in SPARK_HOME\bin only. And make HADOOP_HOME point to same directory as SPARK_HOME. Is it ok? What you will suggest? – Mahesha999 Mar 20 '18 at 11:33
  • Can you explain the need/function of the winutils.exe file? Is it an official part of the Hadoop code? What does it do? etc – Stephen Aug 07 '18 at 18:57
  • Also, it seems like Spark's sbin folder only has scripts designed for use with *nix environments. Did I download the wrong code? – Stephen Aug 07 '18 at 20:53
  • @Stephen as far as your winutil and java version is right, the spark downloaded should work. – Ani Menon Aug 18 '18 at 18:47
  • @Mahesha999 Its a good practice to make the HOME variables point to the folder containing `bin` and other dirs and then adding HOME/bin to the PATH. It would work even if you just place the bin directory path in PATH. – Ani Menon Aug 18 '18 at 18:49
  • These steps assume that one would use Scala as well as Python. Are they must? I have my programs in R, and ideally would like to use them. – Venkat Ramakrishnan Sep 07 '18 at 10:22
35

I found the easiest solution on Windows is to build from source.

You can pretty much follow this guide: http://spark.apache.org/docs/latest/building-spark.html

Download and install Maven, and set MAVEN_OPTS to the value specified in the guide.

But if you're just playing around with Spark, and don't actually need it to run on Windows for any other reason that your own machine is running Windows, I'd strongly suggest you install Spark on a linux virtual machine. The simplest way to get started probably is to download the ready-made images made by Cloudera or Hortonworks, and either use the bundled version of Spark, or install your own from source or the compiled binaries you can get from the spark website.

jkgeyti
  • 2,344
  • 18
  • 31
  • 1
    Thank you for the heads up. Link is fixed. – jkgeyti Mar 26 '15 at 18:57
  • 1
    Hi, My Build on windows works fine with Cygwin but when I run the command ./start-master.sh in the sbin directory I get the error Error: Could not find or load main class org.apache.spark.launcher.Main full log in /cygdrive/c/Spark/spark-1.5.1/sbin/../logs/spark-auser-org.apache.spark.deploy.master.Master-1.host – Geek Oct 05 '15 at 14:10
  • Hi Yashpal, I tried that, but I got stuck on step 5 (winutils). I am unable to copy those files to my bin directory. – Venkat Ramakrishnan Sep 07 '18 at 10:20
21

You can download spark from here:

http://spark.apache.org/downloads.html

I recommend you this version: Hadoop 2 (HDP2, CDH5)

Since version 1.0.0 there are .cmd scripts to run spark in windows.

Unpack it using 7zip or similar.

To start you can execute /bin/spark-shell.cmd --master local[2]

To configure your instance, you can follow this link: http://spark.apache.org/docs/latest/

ajnavarro
  • 738
  • 3
  • 13
  • what hadoop alternative would you suggest? I mean something that we could also install on our windows PCs. Redis? – skan Feb 19 '16 at 19:23
17

You can use following ways to setup Spark:

  • Building from Source
  • Using prebuilt release

Though there are various ways to build Spark from Source.
First I tried building Spark source with SBT but that requires hadoop. To avoid those issues, I used pre-built release.

Instead of Source,I downloaded Prebuilt release for hadoop 2.x version and ran it. For this you need to install Scala as prerequisite.

I have collated all steps here :
How to run Apache Spark on Windows7 in standalone mode

Hope it'll help you..!!!

Nishu Tayal
  • 20,106
  • 8
  • 49
  • 101
8

Trying to work with spark-2.x.x, building Spark source code didn't work for me.

  1. So, although I'm not going to use Hadoop, I downloaded the pre-built Spark with hadoop embeded : spark-2.0.0-bin-hadoop2.7.tar.gz

  2. Point SPARK_HOME on the extracted directory, then add to PATH: ;%SPARK_HOME%\bin;

  3. Download the executable winutils from the Hortonworks repository, or from Amazon AWS platform winutils.

  4. Create a directory where you place the executable winutils.exe. For example, C:\SparkDev\x64. Add the environment variable %HADOOP_HOME% which points to this directory, then add %HADOOP_HOME%\bin to PATH.

  5. Using command line, create the directory:

    mkdir C:\tmp\hive
    
  6. Using the executable that you downloaded, add full permissions to the file directory you created but using the unixian formalism:

    %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive
    
  7. Type the following command line:

    %SPARK_HOME%\bin\spark-shell
    

Scala command line input should be shown automatically.

Remark : You don't need to configure Scala separately. It's built-in too.

Farah
  • 2,469
  • 5
  • 31
  • 52
3

Here's the fixes to get it to run in Windows without rebuilding everything - such as if you do not have a recent version of MS-VS. (You will need a Win32 C++ compiler, but you can install MS VS Community Edition free.)

I've tried this with Spark 1.2.2 and mahout 0.10.2 as well as with the latest versions in November 2015. There are a number of problems including the fact that the Scala code tries to run a bash script (mahout/bin/mahout) which does not work of course, the sbin scripts have not been ported to windows, and the winutils are missing if hadoop is not installed.

(1) Install scala, then unzip spark/hadoop/mahout into the root of C: under their respective product names.

(2) Rename \mahout\bin\mahout to mahout.sh.was (we will not need it)

(3) Compile the following Win32 C++ program and copy the executable to a file named C:\mahout\bin\mahout (that's right - no .exe suffix, like a Linux executable)

#include "stdafx.h"
#define BUFSIZE 4096
#define VARNAME TEXT("MAHOUT_CP")
int _tmain(int argc, _TCHAR* argv[]) {
    DWORD dwLength;     LPTSTR pszBuffer;
    pszBuffer = (LPTSTR)malloc(BUFSIZE*sizeof(TCHAR));
    dwLength = GetEnvironmentVariable(VARNAME, pszBuffer, BUFSIZE);
    if (dwLength > 0) { _tprintf(TEXT("%s\n"), pszBuffer); return 0; }
    return 1;
}

(4) Create the script \mahout\bin\mahout.bat and paste in the content below, although the exact names of the jars in the _CP class paths will depend on the versions of spark and mahout. Update any paths per your installation. Use 8.3 path names without spaces in them. Note that you cannot use wildcards/asterisks in the classpaths here.

set SCALA_HOME=C:\Progra~2\scala
set SPARK_HOME=C:\spark
set HADOOP_HOME=C:\hadoop
set MAHOUT_HOME=C:\mahout
set SPARK_SCALA_VERSION=2.10
set MASTER=local[2]
set MAHOUT_LOCAL=true
set path=%SCALA_HOME%\bin;%SPARK_HOME%\bin;%PATH%
cd /D %SPARK_HOME%
set SPARK_CP=%SPARK_HOME%\conf\;%SPARK_HOME%\lib\xxx.jar;...other jars...
set MAHOUT_CP=%MAHOUT_HOME%\lib\xxx.jar;...other jars...;%MAHOUT_HOME%\xxx.jar;...other jars...;%SPARK_CP%;%MAHOUT_HOME%\lib\spark\xxx.jar;%MAHOUT_HOME%\lib\hadoop\xxx.jar;%MAHOUT_HOME%\src\conf;%JAVA_HOME%\lib\tools.jar
start "master0" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip localhost --port 7077 --webui-port 8082 >>out-master0.log 2>>out-master0.err
start "worker1" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker spark://localhost:7077 --webui-port 8083 >>out-worker1.log 2>>out-worker1.err
...you may add more workers here...
cd /D %MAHOUT_HOME%
"%JAVA_HOME%\bin\java" -Xmx4g -classpath "%MAHOUT_CP%" "org.apache.mahout.sparkbindings.shell.Main"

The name of the variable MAHOUT_CP should not be changed, as it is referenced in the C++ code.

Of course you can comment-out the code that launches the Spark master and worker because Mahout will run Spark as-needed; I just put it in the batch job to show you how to launch it if you wanted to use Spark without Mahout.

(5) The following tutorial is a good place to begin:

https://mahout.apache.org/users/sparkbindings/play-with-shell.html

You can bring up the Mahout Spark instance at:

"C:\Program Files (x86)\Google\Chrome\Application\chrome" --disable-web-security http://localhost:4040
Emul
  • 41
  • 2
2

The guide by Ani Menon (thx!) almost worked for me on windows 10, i just had to get a newer winutils.exe off that git (currently hadoop-2.8.1): https://github.com/steveloughran/winutils

Chris
  • 31
  • 3
1

Here are seven steps to install spark on windows 10 and run it from python:

Step 1: download the spark 2.2.0 tar (tape Archive) gz file to any folder F from this link - https://spark.apache.org/downloads.html. Unzip it and copy the unzipped folder to the desired folder A. Rename the spark-2.2.0-bin-hadoop2.7 folder to spark.

Let path to the spark folder be C:\Users\Desktop\A\spark

Step 2: download the hardoop 2.7.3 tar gz file to the same folder F from this link - https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz. Unzip it and copy the unzipped folder to the same folder A. Rename the folder name from Hadoop-2.7.3.tar to hadoop. Let path to the hadoop folder be C:\Users\Desktop\A\hadoop

Step 3: Create a new notepad text file. Save this empty notepad file as winutils.exe (with Save as type: All files). Copy this O KB winutils.exe file to your bin folder in spark - C:\Users\Desktop\A\spark\bin

Step 4: Now, we have to add these folders to the System environment.

4a: Create a system variable (not user variable as user variable will inherit all the properties of the system variable) Variable name: SPARK_HOME Variable value: C:\Users\Desktop\A\spark

Find Path system variable and click edit. You will see multiple paths. Do not delete any of the paths. Add this variable value - ;C:\Users\Desktop\A\spark\bin

4b: Create a system variable

Variable name: HADOOP_HOME Variable value: C:\Users\Desktop\A\hadoop

Find Path system variable and click edit. Add this variable value - ;C:\Users\Desktop\A\hadoop\bin

4c: Create a system variable Variable name: JAVA_HOME Search Java in windows. Right click and click open file location. You will have to again right click on any one of the java files and click on open file location. You will be using the path of this folder. OR you can search for C:\Program Files\Java. My Java version installed on the system is jre1.8.0_131. Variable value: C:\Program Files\Java\jre1.8.0_131\bin

Find Path system variable and click edit. Add this variable value - ;C:\Program Files\Java\jre1.8.0_131\bin

Step 5: Open command prompt and go to your spark bin folder (type cd C:\Users\Desktop\A\spark\bin). Type spark-shell.

C:\Users\Desktop\A\spark\bin>spark-shell

It may take time and give some warnings. Finally, it will show welcome to spark version 2.2.0

Step 6: Type exit() or restart the command prompt and go the spark bin folder again. Type pyspark:

C:\Users\Desktop\A\spark\bin>pyspark

It will show some warnings and errors but ignore. It works.

Step 7: Your download is complete. If you want to directly run spark from python shell then: go to Scripts in your python folder and type

pip install findspark

in command prompt.

In python shell

import findspark
findspark.init()

import the necessary modules

from pyspark import SparkContext
from pyspark import SparkConf

If you would like to skip the steps for importing findspark and initializing it, then please follow the procedure given in importing pyspark in python shell

Aakash Saxena
  • 199
  • 3
  • 8
0

Here is a simple minimum script to run from any python console. It assumes that you have extracted the Spark libraries that you have downloaded into C:\Apache\spark-1.6.1.

This works in Windows without building anything and solves problems where Spark would complain about recursive pickling.

import sys
import os
spark_home = 'C:\Apache\spark-1.6.1'

sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python\lib\pyspark.zip')) 
sys.path.insert(0, os.path.join(spark_home, 'python\lib\py4j-0.9-src.zip')) 

# Start a spark context:
sc = pyspark.SparkContext()

# 
lines = sc.textFile(os.path.join(spark_home, "README.md")
pythonLines = lines.filter(lambda line: "Python" in line)
pythonLines.first()
HansHarhoff
  • 1,917
  • 2
  • 22
  • 32
0

Cloudera and Hortonworks are the best tools to start up with the HDFS in Microsoft Windows. You can also use VMWare or VBox to initiate Virtual Machine to establish build to your HDFS and Spark, Hive, HBase, Pig, Hadoop with Scala, R, Java, Python.

Divya Arya
  • 439
  • 5
  • 22