How can we find all extra dependencies for PySpark when deploying via pip?

Question

I am trying to deploy PySpark locally using the instructions at

https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi

I can see that extra dependencies are available, such as sql and pandas_on_spark that can be deployed with

pip install pyspark[sql,pandas_on_spark]

But how can we find all available extras?

Looking in the json of the pyspark package (based on https://wiki.python.org/moin/PyPIJSON)

I could not find the possible extra dependencies (as described in What is 'extra' in pypi dependency?); the value for requires_dist is null.

Many thanks for your help.

As far as I know, you can not easily. If it is not documented, then you will have to look at the code/config for the packaging. In this case, here: https://github.com/apache/spark/blob/eb30a27e53158e64fffaa6d32ff9369ffbae0384/python/setup.py#L262-L274 -- `ml`, `mllib`, `sql`, `pandas_on_spark`. — sinoroc, Mar 27 '22 at 11:34
If you already installed pyspark, you can use a workaround described [here](https://stackoverflow.com/a/63603540/6942134) to list its extras. — SergiyKolesnikov, Aug 07 '23 at 16:32

score 2 · Accepted Answer · answered Mar 27 '22 at 11:37

2

As far as I know, you can not easily get the list of extras. If this list is not clearly documented, then you will have to look at the code/config for the packaging. In this case, here which gives the following list: ml, mllib, sql, and pandas_on_spark.

answered Mar 27 '22 at 11:37

sinoroc

18,409
2
39
70

thank you, all these four extras are understood by pip install – karpan Mar 27 '22 at 12:38
1

As of 2023-08-07, there is an additional extra `connect` for Spark Connect. – SergiyKolesnikov Aug 07 '23 at 16:34

How can we find all extra dependencies for PySpark when deploying via pip?

1 Answers1