ModuleNotFoundError: No module named 'elasticsearch' in Dataproc Serverless Pyspark job

Question

I am trying to use elastic search package in Dataproc Serverless Spark pyspark job. I am facing issue only with this package in Dataproc Serverless.

import os
print("Current dir:", os.getcwd())
print("Current dir list:", os.listdir('.'))
 
import pandas
import statsmodels
import platform
import numpy
 
print("Python version:", platform.python_version())
print("Pandas version:", pandas.__version__)
print("statsmodel version:", statsmodels.__version__ )
print("Numpy version:", numpy.__version__)

#import elasticsearch as es
from elasticsearch import Elasticsearch as es
print("elasticsearch version:", es.__version__ )

Below is the output for this code.

Current dir: /tmp/srvls-batch-7554fe27-4044-4341-ae79-ffe9488ea385
Current dir list: ['pyspark_venv.tar.gz', '.test_sls.py.crc', 'test_sls.py']
Python version: 3.9.15
Pandas version: 1.4.4
statsmodel version: 0.13.5
Numpy version: 1.21.6
Traceback (most recent call last):
  File "/tmp/srvls-batch-7554fe27-4044-4341-ae79-ffe9488ea385/test_sls.py", line 16, in <module>
    from elasticsearch import Elasticsearch as es
ModuleNotFoundError: No module named 'elasticsearch'

I followed below steps to setup venv for this job,

https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv

and used --archives option while calling the job. Can anyone please correct me, if I am missing anything. Thanks in advance

May you share full `gcloud` command that you used to submit your job (excluding sensitive information, if any)? — Igor Dvorzhak, Apr 25 '23 at 03:05
gcloud dataproc batches submit --project --region us-central1 pyspark --archives --container-image --version 1.0 --subnet --service-account --properties spark.driver.cores=4,spark.dynamicAllocation.executorAllocationRatio=0.3,spark.executor.cores=4 Hi @IgorDvorzhak Thanks for responding. — ash_ketchum12, Apr 25 '23 at 04:18

score 1 · Answer 1 · answered Apr 25 '23 at 23:04

1

When providing custom Python env (via --archives or container image), you need to configure Spark to use it, instead of the default one.

To do this, you need to set PYSPARK_PYTHON env var to point to Python binary in the custom Python env.

It can be done in container image script, or via Spark properties:

spark.dataproc.driverEnv.PYSPARK_PYTHON=...
spark.executorEnv.PYSPARK_PYTHON=...

answered Apr 25 '23 at 23:04

Igor Dvorzhak

4,360
3
17
31

PYSPARK_PYTHON=./environment/bin/python JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64 SPARK_EXTRA_CLASSPATH=/opt/spark/jars/* :: loading settings :: file = /etc/spark/conf/ivysettings.xml Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) Getting file not found error, I verified the path and file exists – ash_ketchum12 Apr 26 '23 at 09:34
is this error expected ? – ash_ketchum12 Apr 27 '23 at 04:28
How do you know that `./environment/bin/python` is a correct path to the Python binary in new Python env? – Igor Dvorzhak Apr 28 '23 at 04:39
If I am not defining, below paramaters, spark.dataproc.driverEnv.PYSPARK_PYTHON spark.executorEnv.PYSPARK_PYTHON Then I can see in the working directory, I can see 'environment' directory is getting copied. The path what I mentioned exists. Were you able to trigger a batch job with similar requirement. – ash_ketchum12 May 03 '23 at 11:28
Seems like the issue is that `./environment/bin/python` is a symlink to the `/usr/bin/python` that does not exist inside the container. I think that adding `apt install python-is-python3` command during container creation may solve this issue. – Igor Dvorzhak May 03 '23 at 16:48
Hi Igor, Tried the same and just wanted to show that we are using an image where python binaries are installed in /usr/bin/python, but facing same issue. Unfortunately no luck :( spark@38af63f15849:/$ ls -lrt /usr/bin/python* -rwxr-xr-x 1 root root 5479736 Feb 28 2021 /usr/bin/python3.9 lrwxrwxrwx 1 root root 7 Mar 2 2021 /usr/bin/python -> python3 lrwxrwxrwx 1 root root 9 Apr 5 2021 /usr/bin/python3 -> python3.9 – ash_ketchum12 May 04 '23 at 11:32
I think that you need to make sure that in your local environment that you use to create venv, Python binary is also available in the `/usr/bin/python` path (i.e. both your local env and container must us the same Python binary path). May you check this with `which python` in your local env? – Igor Dvorzhak May 05 '23 at 22:02

ModuleNotFoundError: No module named 'elasticsearch' in Dataproc Serverless Pyspark job

1 Answers1