0

I am trying to import pydeequ lib in aws enviroment bulding a job with glue. So, I put a zip file of pydeequ in Python library path and jars file in Dependent JARs path . My script is the following:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pydeequ
from pydeequ.analyzers import *
import findspark
findspark.init()

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

spark = (SparkSession\
   .builder\
   .config("spark.jars.packages", pydeequ.deequ_maven_coord)\
   .config("spark.jars.excludes", pydeequ.f2j_maven_coord)\
   .getOrCreate())

sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

But, I couldn't import the pydeequ lib and I have the following error:

2022-12-21 17:50:31,717 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
  File "/tmp/Test_Pydeequ.py", line 7, in <module>
    import pydeequ
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/tmp/pydeequ.zip/pydeequ/__init__.py", line 21, in <module>
    from pydeequ.configs import DEEQU_MAVEN_COORD
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "/tmp/pydeequ.zip/pydeequ/configs.py", line 37, in <module>
    DEEQU_MAVEN_COORD = _get_deequ_maven_config()
  File "/tmp/pydeequ.zip/pydeequ/configs.py", line 28, in _get_deequ_maven_config
    spark_version = _get_spark_version()
  File "/tmp/pydeequ.zip/pydeequ/configs.py", line 23, in _get_spark_version
    spark_version = output.stdout.decode().split("\n")[-2]
IndexError: list index out of range

I need to work with pydeequ lib inside aws enviroment and I don't know why I had this problem.

I appreciate very much if someone could help me to solve this problem.

1 Answers1

0

So, I solved the problem doing two things:

  1. First step Solution.

I had to open the configs.py file of pydeequ and change the code in the _get_spark_version() method.

    @lru_cache(maxsize=None)
    def _get_spark_version() -> str:
        # Get version from a subprocess so we don't mess up with existing SparkContexts.
        command = [
            "python",
            "-c",
            "from pyspark import SparkContext; print(SparkContext.getOrCreate()._jsc.version())",
        ]
        output = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        #spark_version = output.stdout.decode().split("\n")[-2]
        spark_version = '3.1.1'
        return spark_version

I simply commented the original spark_version declaration and wrote '3.1.1'. That is the spark version used in Glue 3.0.

  1. Second step solution

I also was using a wrong jar file version for deequ. The version was 1.0.3 and this is not compatible with spark version 3.1.1 used by glue 3.0. So, I have to download jar file for 2.0.0 version of deequ.

fedonev
  • 20,327
  • 2
  • 25
  • 34