3

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.

It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.

My question is: is start-history-server.sh included in a conda installation of pyspark? If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?

oulenz
  • 1,199
  • 1
  • 15
  • 24
  • `pyspark` is a python API for Spark. It won't install all Spark tools. If you need all Spark tools you can install from https://medium.com/devilsadvocatediwakar/installing-apache-spark-on-ubuntu-8796bfdd0861 – pedvaljim Jan 29 '19 at 11:36
  • 1
    @pedvaljim That link doesn't provide instructions for installing just any spark tools, but simply for installing the scala version of spark. If it's true that the history server is absent from pyspark, then somehow that feels like an oversight rather than a conscious design decision. How else do people gain an understanding about whether their pyspark code is written well, given that spark is essentially a black box? – oulenz Jan 29 '19 at 12:50
  • Actually all you need to have preinstalled is Java. Scala will be installed by Spark https://spark.apache.org/releases/spark-release-2-4-0.html – pedvaljim Jan 29 '19 at 13:06
  • 1
    Sure, but pyspark already installs spark. In that sense it is not 'just a python api'. So having to have a complete second spark installation just to be able to use the history server sounds wrong. – oulenz Jan 29 '19 at 13:22

1 Answers1

3

EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.


As @pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.

The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

oulenz
  • 1,199
  • 1
  • 15
  • 24