How do I install pyspark for use in standalone scripts?

Question

I'm am trying to use Spark with Python. I installed the Spark 1.0.2 for Hadoop 2 binary distribution from the downloads page. I can run through the quickstart examples in Python interactive mode, but now I'd like to write a standalone Python script that uses Spark. The quick start documentation says to just import pyspark, but this doesn't work because it's not on my PYTHONPATH.

I can run bin/pyspark and see that the module is installed beneath SPARK_DIR/python/pyspark. I can manually add this to my PYTHONPATH environment variable, but I'd like to know the preferred automated method.

What is the best way to add pyspark support for standalone scripts? I don't see a setup.py anywhere under the Spark install directory. How would I create a pip package for a Python script that depended on Spark?

Does the pyspark executable run? Then from within there, you can query where the pyspark package lives, and ensure that the appropriate path is included in your PYTHONPATH for standalone modules. — mdurant, Aug 08 '14 at 14:14
I think that, since installing the whole spark ecosystem is so involved, I'd make do with setting the PYTHONPATH. In any case, you will be executing the scripts using spark-submit - do you have problems with that? — mdurant, Aug 08 '14 at 14:34
Oh, I see. So I don't write standalone Spark Python scripts. I write Python scripts with pyspark dependencies that are then submitted to a Spark cluster. I didn't get that from the quick start writeup, but I guess it makes sense. Hadoop works the same way. If that's correct, you should submit it as an answer, @mdurant. Thanks. — W.P. McNeill, Aug 08 '14 at 14:39
I successfully ran the sample Python app from the getting started guide using spark-submit. Write this up as an answer and collect your prize! — W.P. McNeill, Aug 08 '14 at 16:55
I have a very similar problem: I can run .bin/pyspark but I don't see where the module is installed. How can I find out the `HOME DIRECTORY` of spark? — user3768495, Jun 22 '16 at 05:13

prabeesh · Answer 1 · 2017-07-15T08:28:10.307

36

Spark-2.2.0 onwards use `pip install pyspark` to install pyspark in your machine.

For older versions refer following steps. Add Pyspark lib in Python path in the bashrc

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

also don't forget to set up the SPARK_HOME. PySpark depends the py4j Python package. So install that as follows

pip install py4j

For more details about stand alone PySpark application refer this post

edited Jul 15 '17 at 08:28

answered Apr 07 '15 at 18:01

prabeesh

935
9
11

11

Your reply is Ok, but It would be useful to add that you need full Spark downloaded in your machine. You could think it is obvios, but for a beginer (like me) it isn't – ssoto Jul 13 '15 at 07:16
Please refer this issue [SPARK-1267](https://issues.apache.org/jira/browse/SPARK-1267) – prabeesh Jul 14 '15 at 09:51
3

@ssoto Spark-2.2.0 onwards you can use `pip install pyspark`. – prabeesh Jul 15 '17 at 08:25
3

pip automatically installs `py4j` as a dependency of `pyspark`. – orluke Aug 25 '17 at 15:37
As a prerequisite, make sure to install Java 8 first (as described e.g. http://www.webupd8.org/2014/03/how-to-install-oracle-java-8-in-debian.html or http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html ) – asmaier Sep 06 '17 at 13:32

ssoto · Answer 2 · 2016-12-07T08:46:59.027

15

I install pyspark for use in standalone following a guide. The steps are:

export SPARK_HOME="/opt/spark"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

Then you need install py4j:

pip install py4j

To try it:

./bin/spark-submit --master local[8] <python_file.py>

edited Dec 07 '16 at 08:46

answered Jul 14 '15 at 12:12

ssoto

469
5
19

2

I'm pretty sure it's `PYTHONPATH` and not PYTHON_PATH – Greg May 18 '16 at 06:05
That step for installing `py4j` was helpful. – WestCoastProjects Nov 23 '16 at 07:59

Kamil Sindi · Answer 3 · 2017-07-15T02:57:06.813

11

As of Spark 2.2, PySpark is now available in PyPI. Thanks @Evan_Zamir.

pip install pyspark

As of Spark 2.1, you just need to download Spark and run setup.py:

cd my-spark-2.1-directory/python/
python setup.py install  # or pip install -e .

There is also a ticket for adding it to PyPI.

edited Jul 15 '17 at 02:57

answered Dec 31 '16 at 15:16

Kamil Sindi

21,782
19
96
120

1

And now with Spark 2.2, you can just `pip install pyspark`. :) – Evan Zamir Jul 15 '17 at 00:09

score 8 · Accepted Answer · answered Aug 08 '14 at 18:11

You can set the PYTHONPATH manually as you suggest, and this may be useful to you when testing stand-alone non-interactive scripts on a local installation.

However, (py)spark is all about distributing your jobs to nodes on clusters. Each cluster has a configuration defining a manager and many parameters; the details of setting this up are here, and include a simple local cluster (this may be useful for testing functionality).

In production, you will be submitting tasks to spark via spark-submit, which will distribute your code to the cluster nodes, and establish the context for them to run within on those nodes. You do, however, need to make sure that the python installations on the nodes have all the required dependencies (the recommended way) or that the dependencies are passed along with your code (I don't know how that works).

Please see if this matches with your experience - I am not sure if this is intelligible. — mdurant, Aug 08 '14 at 18:11

score 0 · Answer 5 · answered Aug 16 '16 at 17:25

0

Don't export $SPARK_HOME, do export SPARK_HOME.

answered Aug 16 '16 at 17:25

waku

111
1
4

How do I install pyspark for use in standalone scripts?

5 Answers5

Spark-2.2.0 onwards use `pip install pyspark` to install pyspark in your machine.

Linked

How do I install pyspark for use in standalone scripts?

5 Answers5

Spark-2.2.0 onwards use pip install pyspark to install pyspark in your machine.

Linked

Spark-2.2.0 onwards use `pip install pyspark` to install pyspark in your machine.