0

Is there a way to compare two values of type double in PySpark, with a specified margin of error? Essential similar to this post, but in PySpark.

Something like:

df=#some dataframe with 2 columns RESULT1 and RESULT2

df=withColumn('compare', when(col('RESULT1')==col('RESULT2') +/- 0.05*col('RESULT2'), lit("match")).otherwise(lit("no match"))

But in a more elegant way?

ZygD
  • 22,092
  • 39
  • 79
  • 102
thentangler
  • 1,048
  • 2
  • 12
  • 38

2 Answers2

1

You can use between as the condition:

df2 = df.withColumn(
    'compare',
    when(
        col('RESULT1').between(0.95*col('RESULT2'), 1.05*col('RESULT2')), 
        lit("match")
    ).otherwise(
        lit("no match")
    )
)
mck
  • 40,932
  • 13
  • 35
  • 50
1

You can also write as |RESULT1 - RESULT2| <= 0.05 * RESULT2 :

from pyspark.sql import functions as F

df1 = df.withColumn(
    'compare',
    F.when(
        F.abs(F.col('RESULT1') - F.col("RESULT2")) <= 0.05 * F.col("RESULT2"),
        F.lit("match")
    ).otherwise(F.lit("no match"))
)
blackbishop
  • 30,945
  • 11
  • 55
  • 76