0

I have deployed HDP: 2.6.4 on a virtual machine

I can see that the spark2 is not pointing to the correct python folder. My questions are

1) How can I find where my python is located?

solution: Type whereis python and you will get a list of where it is

2) How can I update the existing python libraries and add new libraries to that folder ? For example, the equivalent of 'pip install numpy' on CLI.

  • Nothing clear yet

3) How can I make Zeppelin Spark2 point at that specific directory that contains the python folder that I can update? - On Zeppelin, there is a little 'edit' button that I can change the path to the directory that contains python.

solution: go to the interpreter in zeppelin, find spark2, and make zeppelin.pyspark.python point to where python is already there.

Now if you need python 3.4+ there is a whole set of different steps you have to do, to first get python 3.4.+ into the HDP sandbox.

Thank you,

1 Answers1

1

For a Sandbox environment like yours, a sandbox image is made on a Linux OS (CentOS). The Zeppelin Notebook points, in all probability, to the Python installation that comes along with every Linux OS. If you wish to have your own installation of Python and your own set of libraries for Data Analysis like those in the SciPy stack. You need to install Anaconda on your Virtual machine. Your VM eed to be connected to the internet so that you can download and install the Anaconda package for testing.

You can then point Zeppelin to the anaconda's directory till the following path : /home/user/anaconda3/bin/python where user is your username

Zeppelin Configuration also confirms the fact that it uses the default python installation at /usr/bin/python. You can go through its documentation for more Information

UPDATE

Hi Joseph, Spark Installations, by default, use the Python interpreter and the python libraries that have been installed on your OS. The folder structure that you have shown only tell you the location of the PySpark module. This module is a library like Pandas ior NumPy.

What you can do is install the SciPy Stack[NumPy, Pandas, MatplotLib etc..] via the command pip install package name and import those libraries directly into your Zeppelin Notebook.

Use the command whereis python in the terminal of your snadbox, the result would give you something as follows /usr/bin/python /usr/bin/python2.7 ....

In your Zeppelin Configuration, for the property zeppelin.pyspark.python you can set the first value from the out put of the previous command i.e /usr/bin/python. So now all the libraries you installed via the pip install command would be available for you in zeppelin.

This process would only work for your Sandbox environment. In a real production cluster, your administrator needs to install all these libraries on all the nodes of your Spark cluster.

Yayati Sule
  • 1,601
  • 13
  • 25
  • I do not wish to have my own installation of python. I want to access, update and point pyspark to the spark2. I would like to use the pyspark folder and update it so it runs within the spark2 interpreter. Kindly check the update :) –  May 30 '18 at 08:36
  • Hi Joseph, Spark Installations by default use the Python interpreter and the python libraries that have been installed on your OS. The folder structure that you have shown only tell you the location of the PySpark module. This module is a library like Pandas ior NumPy.What you can do is install the SciPy Stack[NumPy, Pandas, MatplotLib etc..] via the command `pip install package name` and import those libraries directly into your Zeppelin Notebook. ` – Yayati Sule May 30 '18 at 09:05
  • Thank you, you have cleared out alot. I got some errors I will try to fix, btw what is the latest version I can use with spark2 in zeppelin in HDP: 2.6.4 –  May 30 '18 at 10:12
  • regarding my previous comment, I got sc.version res0: String = 2.2.0.2.6.4.0-91 –  May 30 '18 at 10:14
  • 1
    The language versions supported by Apache Spark 2.2.0 as per their website http://spark.apache.org/docs/2.2.0/ are as follows: Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.2.0 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x). – Yayati Sule May 30 '18 at 10:17
  • I made another question specific to installing libraries https://stackoverflow.com/questions/50603891/how-to-install-libraries-to-python-in-zeppelin-spark2-in-hdp –  May 30 '18 at 11:48
  • Can you please provide the link for installing miniconda in HDP for zeppelin use? The python version must be 3.4.x –  May 30 '18 at 12:50
  • I answered the question asked by you in the question whose link you provided. You do not need miniconda or anaconda if you install the packages yourself. miniconda can be found on this link https://conda.io/miniconda.html. To use the miniconda installation with pyspark read this answer of mine : https://stackoverflow.com/a/45460559/5742662. – Yayati Sule May 30 '18 at 12:58