-1

I'm trying to perform some exploratory data analysis by summarizing the distribution of measurements within my dataset using the PySpark describe() function. However, for the measurements that have a negative distribution, the min and max values appear to be flipped.

chicago_crime.describe('latitude', 'longitude').show()

+-------+-------------------+--------------------+
|summary|           latitude|           longitude|
+-------+-------------------+--------------------+
|  count|            6811141|             6811141|
|   mean|  41.84203025139101|  -87.67177837500668|
| stddev|0.08994460772003067|0.062086304377221284|
|    min|       36.619446395|       -87.524529378|
|    max|       42.022910333|       -91.686565684|
+-------+-------------------+--------------------+

The longitude measurement has a negative distribution. I expected the min for longitude to be -91.686565684 and the max to be -87.524529378.

Has anyone else noticed this error? Can the PySpark developers correct this error?

As per request below, here is the printSchema() output.

chicago_crime.printSchema()

root
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)

And converting to float then shows the expected result.

chicago_crime = chicago_crime.withColumn('latitude', chicago_crime.latitude.astype('float'))
chicago_crime = chicago_crime.withColumn('longitude', chicago_crime.longitude.astype('float'))

chicago_crime.describe('latitude', 'longitude').show()

+-------+-------------------+--------------------+
|summary|           latitude|           longitude|
+-------+-------------------+--------------------+
|  count|            6810978|             6810978|
|   mean|  41.84215369600549|   -87.6716834892099|
| stddev|0.08628712634075986|0.058938763393995654|
|    min|          41.644585|          -87.934326|
|    max|           42.02291|           -87.52453|
+-------+-------------------+--------------------+
  • Almost surely the issue is that your column is of `StringType()` and not numeric- thus the short is happening lexicographically. Show the output of `chicago_crime.printSchema()` to check. – pault Aug 15 '19 at 14:30
  • Please see the output above. You are correct. – Tom Weichle Aug 15 '19 at 18:42
  • Possible duplicate of [how to change a Dataframe column from String type to Double type in pyspark](https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark) – pault Aug 15 '19 at 18:55
  • Convert your columns to double type, then compute the statistics. – pault Aug 15 '19 at 18:55
  • @pault I have converted to float above and computed the statistics. The results were as expected. Thanks for your help! – Tom Weichle Aug 15 '19 at 23:17

1 Answers1

-1

I tried below code:

from pyspark.sql import Row
df = spark.sparkContext.parallelize([Row(-1),Row(-2), Row(-3)]).toDF()
df.describe().show()

I got expected result as below:

+-------+----+
|summary|  _1|
+-------+----+
|  count|   3|
|   mean|-2.0|
| stddev| 1.0|
|    min|  -3|
|    max|  -1|
+-------+----+
  • That's nice that it works for you but this doesn't answer the OP's question, outside of showing that `describe()` is able to properly sort negative numbers which is not a particularly surprising result. – pault Aug 15 '19 at 14:31