PySpark: Error in describe() function when summarizing distribution of negative numbers -- min and max values flipped

Question

I'm trying to perform some exploratory data analysis by summarizing the distribution of measurements within my dataset using the PySpark describe() function. However, for the measurements that have a negative distribution, the min and max values appear to be flipped.

chicago_crime.describe('latitude', 'longitude').show()

+-------+-------------------+--------------------+
|summary|           latitude|           longitude|
+-------+-------------------+--------------------+
|  count|            6811141|             6811141|
|   mean|  41.84203025139101|  -87.67177837500668|
| stddev|0.08994460772003067|0.062086304377221284|
|    min|       36.619446395|       -87.524529378|
|    max|       42.022910333|       -91.686565684|
+-------+-------------------+--------------------+

The longitude measurement has a negative distribution. I expected the min for longitude to be -91.686565684 and the max to be -87.524529378.

Has anyone else noticed this error? Can the PySpark developers correct this error?

As per request below, here is the printSchema() output.

chicago_crime.printSchema()

root
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)

And converting to float then shows the expected result.

chicago_crime = chicago_crime.withColumn('latitude', chicago_crime.latitude.astype('float'))
chicago_crime = chicago_crime.withColumn('longitude', chicago_crime.longitude.astype('float'))

chicago_crime.describe('latitude', 'longitude').show()

+-------+-------------------+--------------------+
|summary|           latitude|           longitude|
+-------+-------------------+--------------------+
|  count|            6810978|             6810978|
|   mean|  41.84215369600549|   -87.6716834892099|
| stddev|0.08628712634075986|0.058938763393995654|
|    min|          41.644585|          -87.934326|
|    max|           42.02291|           -87.52453|
+-------+-------------------+--------------------+

Almost surely the issue is that your column is of `StringType()` and not numeric- thus the short is happening lexicographically. Show the output of `chicago_crime.printSchema()` to check. — pault, Aug 15 '19 at 14:30
Possible duplicate of [how to change a Dataframe column from String type to Double type in pyspark](https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark) — pault, Aug 15 '19 at 18:55
Convert your columns to double type, then compute the statistics. — pault, Aug 15 '19 at 18:55
@pault I have converted to float above and computed the statistics. The results were as expected. Thanks for your help! — Tom Weichle, Aug 15 '19 at 23:17

abhishek kumar dubey · Answer 1 · 2019-08-15T12:36:15.450

-1

I tried below code:

from pyspark.sql import Row
df = spark.sparkContext.parallelize([Row(-1),Row(-2), Row(-3)]).toDF()
df.describe().show()

I got expected result as below:

+-------+----+
|summary|  _1|
+-------+----+
|  count|   3|
|   mean|-2.0|
| stddev| 1.0|
|    min|  -3|
|    max|  -1|
+-------+----+

edited Aug 15 '19 at 12:36

answered Aug 15 '19 at 12:29

abhishek kumar dubey

19
4

That's nice that it works for you but this doesn't answer the OP's question, outside of showing that `describe()` is able to properly sort negative numbers which is not a particularly surprising result. – pault Aug 15 '19 at 14:31

PySpark: Error in describe() function when summarizing distribution of negative numbers -- min and max values flipped

1 Answers1