I am trying to import pydeequ lib in aws enviroment bulding a job with glue. So, I put a zip file of pydeequ in Python library path and jars file in Dependent JARs path . My script is the following:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pydeequ
from pydeequ.analyzers import *
import findspark
findspark.init()
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark = (SparkSession\
.builder\
.config("spark.jars.packages", pydeequ.deequ_maven_coord)\
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)\
.getOrCreate())
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
But, I couldn't import the pydeequ lib and I have the following error:
2022-12-21 17:50:31,717 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
File "/tmp/Test_Pydeequ.py", line 7, in <module>
import pydeequ
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "/tmp/pydeequ.zip/pydeequ/__init__.py", line 21, in <module>
from pydeequ.configs import DEEQU_MAVEN_COORD
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "/tmp/pydeequ.zip/pydeequ/configs.py", line 37, in <module>
DEEQU_MAVEN_COORD = _get_deequ_maven_config()
File "/tmp/pydeequ.zip/pydeequ/configs.py", line 28, in _get_deequ_maven_config
spark_version = _get_spark_version()
File "/tmp/pydeequ.zip/pydeequ/configs.py", line 23, in _get_spark_version
spark_version = output.stdout.decode().split("\n")[-2]
IndexError: list index out of range
I need to work with pydeequ lib inside aws enviroment and I don't know why I had this problem.
I appreciate very much if someone could help me to solve this problem.