Questions tagged [pydeequ]

PyDeequ is a Python API for Deequ, a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

9 questions
2
votes
1 answer

Error importing PyDeequ package on databricks

I want to do some tests regarding data quality and for that I pretend to use PyDeequ on a databricks notebook. Keep in mind that I'm very new to databricks and Spark. First, I created a cluster with the Runtime version "10.4 LTS (includes Apache…
1
vote
1 answer

Pydeequ throwing Py4JJavaError

I have the following installation of Pydeequ: In an anaconda environment, I have installed pyspark 3.0.0, pydeequ last release and sagemaker_pyspark last release. from pyspark.sql import SparkSession import os os.environ["SPARK_VERSION"] =…
Norhther
  • 545
  • 3
  • 15
  • 35
1
vote
1 answer

closing pydeequ callback server

I'm using pydeequ with Spark 3.0.1 to perform some constraint checks on data. As for testing with the VerificationSuite, after calling VerificationResult.checkResultsAsDataFrame(spark, result), it seems that the callback server which gets started by…
dataviews
  • 2,466
  • 7
  • 31
  • 64
0
votes
0 answers

Error using PyDeequ Profile in Databricks

I am new to Python, Databricks, and pydeequ. I'm trying to use pydeequ in Databricks. I installed the library via Maven using "com.amazon.deequ:deequ:2.0.4-spark-3.3". The analyzers are working, but not the profilerunner. I am trying to run this…
0
votes
0 answers

Pydeequ satisfy custom expression

Most of the checks in the examples or docs involve just two columns and simple strongly typed functions like (isGreaterThanEqualTo etc). Is there a way to introduce checks like: columnA + columnB <= columnC - columnD etc. Any way to add a lambda…
trequartista
  • 167
  • 10
0
votes
0 answers

How do I import Pydeequ on Glue jupyter notebooks?

I have been trying to import Pydeequ to develop tests on AWS Glue's notebook environment. I have downloaded pydeequ.zip file appropriately, and the jar file (deequ-2.0.0-spark-3.1.jar). Both of them are in an s3 bucket. I am using Glue 3.0 which…
Jonathan
  • 46
  • 3
0
votes
1 answer

Error importing PyDeequ package on Glue 3.0

I am trying to import pydeequ lib in aws enviroment bulding a job with glue. So, I put a zip file of pydeequ in Python library path and jars file in Dependent JARs path . My script is the following: import sys from awsglue.transforms import * from…
0
votes
1 answer

How to set dynamic assert conditions for deequ verification checks in scala

I am using deequ verificationsuite to validate my sql tables but I am unable to implement dynamic assert conditions for checks : val verificationResult: VerificationResult = { VerificationSuite() .onData(dataset) .addCheck( …
vibhor Gupta
  • 103
  • 11
0
votes
1 answer

Validation using pydeequ within a Glue job will prevent the job from completing

I am attempting to use the AWS Big Data Blog article to create a job in AWS Glue Studio and use pydeequ to validate the data. I was successful in running pydeequ in the job, but when using some of the Check methods, the job kept running even after…
trgs
  • 1
  • 1