filtering in Pyspark using integer vs decimal values

Question

I am filtering a DataFrame and when I pass an integer value, it considers only those that satisfy the condition when the DataFrame column value is rounded to an integer. Why is this happening? See the screenshot below, the two filters give different results. I am using Spark 2.2. I tested it with python 2.6 and python 3.5. The results are the same.

Update

I tried it with Spark-SQL. If I do not convert the field to double, it gives the same answer as the first one above. However, if I cast the column to double before filtering, it gives correct answer.

Yes, using 60L does not solve it. In Python2, it gives the same answer in python3, it gives SyntaxError. — Fisseha Berhane, Mar 23 '18 at 15:58
Firstly [don't post pictures of code](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question). Second, please provide an [mcve] so we can try to recreate your issue. More on [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Mar 23 '18 at 16:17

score 4 · Accepted Answer · answered Mar 23 '18 at 16:58

4

for lat > 60

Given a double and an integer spark is implicitly converting both of them to integers. The result is appropriate, showing latitudes >= 61

for lat > cast(60 as double) or lat > 60.0 Given two doubles spark returns everything in the set [Infinity, 60.0), as expected

This might be slightly un-intuitive, but you must remember that spark is performing implicit conversions between IntegerType() and DoubleType()

answered Mar 23 '18 at 16:58

Steven Black

1,988
1
15
25

FWIW, spark is making a lot of implicit conversions/casts when comparing values. It becomes very disturbing when comparing integers and strings – Dror May 10 '23 at 08:22

score 1 · Answer 2 · answered Mar 23 '18 at 22:02

Although you use pyspark, under the hood it is in Scala and ultimately Java. So Java's conversion rules apply here.

To be specific

https://docs.oracle.com/javase/specs/jls/se10/html/jls-5.html#jls-5.1.3

...Otherwise, if the floating-point number is not an infinity, the floating-point value is rounded to an integer value V, rounding toward zero using IEEE 754 round-toward-zero mode (§4.2.3).

filtering in Pyspark using integer vs decimal values

Update

2 Answers2