Questions tagged [amazon-deequ]

Github page

57 questions
8
votes
1 answer

Uniqueness check in Deequ

I'm currently exploring Deequ library and I'm trying to understand whether it's possible to check for the uniqueness of a combination of column. This code .hasUniqueness(Seq("col1", "col2"), Check.IsOne)) seems to calculate uniqueness for each…
Dawid
  • 652
  • 1
  • 11
  • 24
5
votes
2 answers

Data testing framework for data streaming (deequ vs Great Expectations)

I want to introduce data quality testing (empty fields/max-min values/regex/etc...) into my pipeline which will essentially consume kafta topics testing the data before it is logged into the DB. I am having a hard time choosing between the Deequ and…
Andy MGF
  • 133
  • 1
  • 7
3
votes
0 answers

Pydeequ - Datatype check for datetime column

We are implementing Pydeequ process for doing certain validations including data type check (using verification suite). However based on the understanding, the hasDataType fucntion can be used only for validating against the following - any thoughts…
Murugan S
  • 47
  • 4
3
votes
2 answers

how to run all suggested checks in pydeequ

I have just started with pydeequ and I want to create checks for spark dataframe that has ~1800 features. Now to know which checks I must perform, I do the following suggestionResult = ConstraintSuggestionRunner(spark) \ .onData(df) \ …
Shoaibkhanz
  • 1,942
  • 3
  • 24
  • 41
2
votes
0 answers

deequ rule to check for gap in numeric sequence

With deequ is there a way of checking that a sequence has no gap in it? Similar to this sql
oluies
  • 17,694
  • 14
  • 74
  • 117
2
votes
1 answer

How to Store Failed Status Records of Amazon Deequ in a Separate Spark DataFrame

I have a requirement to run Data Quality Test So I am using Amazon Deequ for this. I am able to find the Data Quality Success/Failure Status using below code, but next I want to get all the rows which was failed in check and Store into another…
2
votes
1 answer

How to check if values of a DateType column are within a range of specified dates?

So, I'm using Amazon Deequ in Spark, and I have a dataframe df with a column publish_date which is of type DateType. I simply want to check the following: publish_date <= current_date(minus)x AND publish_date >= current_date(minus)y where x and y…
2
votes
0 answers

Parsing Deequ Rules from a csv/table dynamically

I'm using amazon deequ library and trying to pass the rules from a csv or a mysql table. My csv file will have the column with values like…
Riyan Mohammed
  • 247
  • 2
  • 6
  • 20
2
votes
2 answers

Load constraints from csv-file (amazon deequ)

I'm checking out Deequ which seems like a really nice library. I was wondering if it is possible to load constraints from a csv file or an orc-table in HDFS? Lets say I have a table with theese types case class Item( id: Long, productName:…
1
vote
1 answer

Pydeequ throwing Py4JJavaError

I have the following installation of Pydeequ: In an anaconda environment, I have installed pyspark 3.0.0, pydeequ last release and sagemaker_pyspark last release. from pyspark.sql import SparkSession import os os.environ["SPARK_VERSION"] =…
Norhther
  • 545
  • 3
  • 15
  • 35
1
vote
1 answer

closing pydeequ callback server

I'm using pydeequ with Spark 3.0.1 to perform some constraint checks on data. As for testing with the VerificationSuite, after calling VerificationResult.checkResultsAsDataFrame(spark, result), it seems that the callback server which gets started by…
dataviews
  • 2,466
  • 7
  • 31
  • 64
1
vote
0 answers

Data Quality Framework in AWS

I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are: The data pipelines widely vary and ingest very high volumes…
1
vote
1 answer

PyDeequ hasPattern fails with 'PatternMatch' object has no attribute '_Check'

I'm trying to run the sample code for pattern check "hasPattern()" with PyDeequ and it fails with Exception The code: import pydeequ from pyspark.sql import SparkSession, Row spark = (SparkSession .builder …
1
vote
0 answers

AWS Deequ Checks Error: isGreaterThanOrEqualTo is not a member of com.amazon.deequ.VerificationRunBuilder

I run the following command on Databricks Notebook with com.amazon.deequ:deequ:2.0.0-spark-3.1 library for checking data quality on input data, and I got error messages on certain functions a member of com.amazon.deequ.VerificationRunBuilder. Where…
fullysane
  • 51
  • 1
1
vote
1 answer

Type arguments do not conform to trait type parameter bounds

I am using a library which is written by amazon in scala here The trait goes like this : trait Analyzer[S <: State[_], +M <: Metric[_]] I am trying to make a case object to store some information and an instace of the above Analyzer is a part of…
Shiv
  • 105
  • 7
1
2 3 4