I'm trying to perform some exploratory data analysis by summarizing the distribution of measurements within my dataset using the PySpark describe() function. However, for the measurements that have a negative distribution, the min and max values appear to be flipped.
chicago_crime.describe('latitude', 'longitude').show()
+-------+-------------------+--------------------+
|summary| latitude| longitude|
+-------+-------------------+--------------------+
| count| 6811141| 6811141|
| mean| 41.84203025139101| -87.67177837500668|
| stddev|0.08994460772003067|0.062086304377221284|
| min| 36.619446395| -87.524529378|
| max| 42.022910333| -91.686565684|
+-------+-------------------+--------------------+
The longitude
measurement has a negative distribution. I expected the min for longitude
to be -91.686565684 and the max to be -87.524529378.
Has anyone else noticed this error? Can the PySpark developers correct this error?
As per request below, here is the printSchema()
output.
chicago_crime.printSchema()
root
|-- latitude: string (nullable = true)
|-- longitude: string (nullable = true)
And converting to float then shows the expected result.
chicago_crime = chicago_crime.withColumn('latitude', chicago_crime.latitude.astype('float'))
chicago_crime = chicago_crime.withColumn('longitude', chicago_crime.longitude.astype('float'))
chicago_crime.describe('latitude', 'longitude').show()
+-------+-------------------+--------------------+
|summary| latitude| longitude|
+-------+-------------------+--------------------+
| count| 6810978| 6810978|
| mean| 41.84215369600549| -87.6716834892099|
| stddev|0.08628712634075986|0.058938763393995654|
| min| 41.644585| -87.934326|
| max| 42.02291| -87.52453|
+-------+-------------------+--------------------+