How do I get Python libraries in pyspark?

Question

I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.

When I try to import any of them I get the below error:

>>> from shapely.geometry import polygon
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ImportError: No module named shapely.geometry

I know the module isn't present, but how can these packages be brought to my pyspark libraries?

I want to install it in pyspark, not in my local machine. this command does not work in pyspark shell. — thenakulchawla, Mar 25 '16 at 09:48
This is a possible duplicate of http://stackoverflow.com/q/29495435/1711188 — Andy MacKinlay, Jan 12 '17 at 05:48
Possible duplicate of [shipping python modules in pyspark to other nodes?](https://stackoverflow.com/questions/24686474/shipping-python-modules-in-pyspark-to-other-nodes) — ivan_pozdeev, Apr 29 '18 at 12:54

score 21 · Accepted Answer · edited Oct 27 '22 at 06:55

21

In the Spark context try using:

SparkContext.addPyFile("module.py")  # Also supports .zip

Quoting from the docs:

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

edited Oct 27 '22 at 06:55

akki

2,021
1
24
35

answered Mar 25 '16 at 10:52

armatita

12,825
8
48
49

1

I am able to add this dependency. Is there a way to do this when I am doing a spark submit. I am doing a spark-submit of file.py, in that file should I be doing addPyFile("module.py") or is there a way to add dependencies by adding an argument to the spark-submit command – thenakulchawla May 13 '16 at 06:58
From the Spark doc (https://spark.apache.org/docs/1.1.0/submitting-applications.html) it seems feasible to add a py file by argument (putting it in the search path). However I do not know what if the submition API for PySpark is different in any way. – armatita May 13 '16 at 07:27
ok I will try it in an argument, and in my file. Both ways to see what works. – thenakulchawla May 13 '16 at 08:26
4

Has anyone successfully uploaded a .zip file this way? It's not working for me when uploading packages, even those without dependencies. – pir Jun 06 '17 at 20:29
@pir did you figure it out? – Ivan Bilan Mar 06 '18 at 11:07
1

@ivan_bilan Way late, but... Had a similar problem and got addPyFile() to work for me. Please see the full post here: https://stackoverflow.com/q/51450462/8236733. The question may help you as an example and the answer may at least be a useful debugging step. – lampShadesDrifter Jul 24 '18 at 23:49

Hussain Bohra · Answer 2 · 2016-12-23T19:52:46.573

This is how I get it worked in our AWS EMR cluster (It should be same in any other cluster as well). I created the following shell script and executed it as a bootstrap-actions:

#!/bin/bash
# shapely installation
wget http://download.osgeo.org/geos/geos-3.5.0.tar.bz2
tar jxf geos-3.5.0.tar.bz2
cd geos-3.5.0 && ./configure --prefix=$HOME/geos-bin && make && make install
sudo cp /home/hadoop/geos-bin/lib/* /usr/lib
sudo /bin/sh -c 'echo "/usr/lib" >> /etc/ld.so.conf'
sudo /bin/sh -c 'echo "/usr/lib/local" >> /etc/ld.so.conf'
sudo /sbin/ldconfig
sudo /bin/sh -c 'echo -e "\nexport LD_LIBRARY_PATH=/usr/lib" >> /home/hadoop/.bashrc'
source /home/hadoop/.bashrc
sudo pip install shapely
echo "Shapely installation complete"
pip install https://pypi.python.org/packages/74/84/fa80c5e92854c7456b591f6e797c5be18315994afd3ef16a58694e1b5eb1/Geohash-1.0.tar.gz
#
exit 0

Note: Instead of running as a bootstrap-actions this script can be executed independently in every node in a cluster. I have tested both scenarios.

Following is a sample pyspark and shapely code (Spark SQL UDF) to ensure above commands are working as expected:

Python 2.7.10 (default, Dec  8 2015, 18:25:23) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.10 (default, Dec  8 2015 18:25:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType
>>> from shapely.wkt import loads as load_wkt
>>> def parse_region(region):
...     from shapely.wkt import loads as load_wkt
...     reverse_coordinate = lambda coord: ' '.join(reversed(coord.split(':')))
...     coordinate_list = map(reverse_coordinate, region.split(', '))
...     if coordinate_list[0] != coordinate_list[-1]:
...         coordinate_list.append(coordinate_list[0])
...     return str(load_wkt('POLYGON ((%s))' % ','.join(coordinate_list)).wkt)
... 
>>> udf_parse_region=udf(parse_region, StringType())
16/09/06 22:18:34 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/09/06 22:18:34 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
>>> df = sqlContext.sql('select id, bounds from <schema.table_name> limit 10')
>>> df2 = df.withColumn('bounds1', udf_parse_region('bounds'))
>>> df2.first()
Row(id=u'0089d43a-1b42-4fba-80d6-dda2552ee08e', bounds=u'33.42838509594465:-119.0533447265625, 33.39170168789402:-119.0203857421875, 33.29992542601392:-119.0478515625', bounds1=u'POLYGON ((-119.0533447265625 33.42838509594465, -119.0203857421875 33.39170168789402, -119.0478515625 33.29992542601392, -119.0533447265625 33.42838509594465))')
>>>

Thanks, Hussain Bohra

score 4 · Answer 3 · answered Jan 23 '20 at 15:11

4

I found a great solution from AWS Docs using SparkContext. I was able to add Pandas and other packages using this:

Using SparkContext to add packages to notebook with PySpark Kernel in EMR

sc.install_pypi_package("pandas==0.25.1")

answered Jan 23 '20 at 15:11

faisal12

79
4

I think install_pypi_package would only install in the driver node but the libraries are not available in the workers – Gianmario Spacagna Apr 09 '20 at 14:09

score 2 · Answer 4 · answered Aug 08 '16 at 00:27

Is this on standalone (i.e. laptop/desktop) or in a cluster environment (e.g. AWS EMR)?

If on your laptop/desktop, pip install shapely should work just fine. You may need to check your environment variables for your default python environment(s). For example, if you typically use Python 3 but use Python 2 for pyspark, then you would not have shapely available for pyspark.

If in a cluster environment such as in AWS EMR, you can try:

import os

def myfun(x):`
        os.system("pip install shapely")
        return x
rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes
rdd.map(lambda x: myfun(x)).collect() 
## call each cluster to run the code to import the library

"I know the module isn't present, but I want to know how can these packages be brought to my pyspark libraries."

On EMR, if you want pyspark to be pre-prepared with whatever other libraries and configurations you want, you can use a bootstrap step to make those adjustments. Aside from that, you can't "add" a library to pyspark without compiling Spark in Scala (which would be a pain to do if you're not savvy with SBT).

The problem with this is that is fails to install the package on node 3 if it was in use. — user48956, Jun 16 '17 at 18:30
You can use a bash script at the start up of your EMR (hopefully you're using EMR if on AWS) to install all your needed libraries. This is the "bootstrap install step" — Jon, Jun 16 '17 at 18:32
@user48956 you mustn't import any 3rd-party packages that might get updated before you update everything you need. — ivan_pozdeev, Feb 04 '18 at 23:53

How do I get Python libraries in pyspark?

4 Answers4

Linked