0

Currently I'm working on a Python 3.6 project with some other people. We use a requirements.txt file to store our dependencies which will be installed with pip or conda.

I added pyspark >= 2.2.0 which will run pip install pyspark. We make use of anaconda. The installation has no errors and I can find the pyspark directory in my local Anaconda env site-packages directory.

When I run my python script, which has some Spark code in it. I get the error: Failed to find Spark jars directory. After some research I found out that I need to build the pyspark code because it isn't prebuilt when it comes with pip.

I read the documentation but it isn't clear to me how to build the code. Why is there no build directory in my pyspark installation directory (needed to build it with build/mvn)? I prefer to use requirements.txt because I don't want all developers to download & install pyspark by there own.

EDIT - Main problem when running pyspark commands in shell is the following error:

Failed to find Spark jars directory.

You need to build Spark before running this program.

Community
  • 1
  • 1
Mike Evers
  • 185
  • 1
  • 4
  • 15
  • 1
    as @user8371915 said, also https://stackoverflow.com/questions/25205264/how-do-i-install-pyspark-for-use-in-standalone-scripts contains useful information about it. – F. Leone Dec 12 '17 at 18:05
  • The problem is not with running the command 'pip install pyspark'. But with building the code. The installation doesn't contain a build directory. – Mike Evers Dec 12 '17 at 18:08

1 Answers1

2

I've only recently used pip install pyspark, and was able to use Spark immediately (without building).

If you activate the environment and simply run pyspark you should see PySpark working, which indicates that the jars are built.

I've checked my environment are the components are located thus:

  • The shell scripts (spark-shell, etc) will be placed in the bin directory within your conda environment, e.g. ~/.conda/envs/my_env/bin.
  • The binaries themselves are inside the jars folder of the pyspark directory, i.e. ~/.conda/envs/my_env/lib/python3.6/site-packages/pyspark/jars

To use pyspark within a conda environment you just need to create an environment, activate it, and install. This is as simple as running these four commands.

conda create -n my-env python=3.5
source activate my-env
pip install pyspark
pyspark

If you have pyspark inside your requirements.txt file you can replace the pip install pyspark line above with pip install -r requirements.txt.

Kirk Broadhurst
  • 27,836
  • 16
  • 104
  • 169
  • My shell scrips are located in this folder C:\Users\username\Anaconda3\Lib\site-packages\pyspark\bin and the jars are located here C:\Users\username\Anaconda3\Lib\site-packages\pyspark\jars but the shell scripts fail with the following error: **the system cannot find the path specified**. – Mike Evers Dec 12 '17 at 18:22
  • That path looks wrong. Did you create a virtual environment, and install with that environment? You should have an environment specific `PYTHONPATH` and all subsequent components are self-contained within that environment. – Kirk Broadhurst Dec 12 '17 at 19:21
  • Do you have a guide? I want it to work with requirements.txt – Mike Evers Dec 12 '17 at 20:34
  • @MikeEvers just google 'how to use conda environments'. e.g. https://conda.io/docs/user-guide/getting-started.html#managing-envs – Kirk Broadhurst Dec 12 '17 at 20:45
  • Keep getting this when running pyspark command: Failed to find Spark jars directory. You need to build Spark before running this program. – Mike Evers Dec 12 '17 at 21:10
  • I appreciate all your effort but the main problem still resists. When running simple command like pyspark and spark-submit I get this error: **Failed to find Spark jars directory. You need to build Spark before running this program.** – Mike Evers Dec 12 '17 at 21:44
  • I also get this error when running the scrips in the shell: **The system cannot find the path specified** – Mike Evers Dec 12 '17 at 21:55
  • @KirkBroadhurst Are you sure that, before running `pip install pyspark`, you didn't have Spark already downloaded in your machine? – desertnaut Dec 13 '17 at 13:02
  • Yes , I'm sure. – Mike Evers Dec 13 '17 at 13:03
  • 1
    @MikeEvers indeed - I have just found out this, too – desertnaut Dec 13 '17 at 14:44
  • 2
    @desertnaut sorry, that was a big assumption on my part. Years ago I tried to do Spark development on Windows and it wasn't a good experience; I didn't consider that might be your platform. Strongly recommend using Linux. I assume it will work in Windows (the same binaries would be downloaded) and it will 'simply' be a matter of configuration. Unfortunately Spark configuration in Windows is horrible - again, in my long ago experience. – Kirk Broadhurst Dec 13 '17 at 15:13