I have the following data (just showing a snippet)
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
I read it with inferSchema
option set to true
and then describe
the columns. It seem to work fine.
scala> val data = spark.read.option("header", "true").option("inferSchema","true").csv("./data/flight-data/csv/2015-summary.csv")
scala> data.describe().show()
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-------+-----------------+-------------------+------------------+
| count| 256| 256| 256|
| mean| null| null| 1770.765625|
| stddev| null| null|23126.516918551915|
| min| Algeria| Angola| 1|
| max| Zambia| Vietnam| 370002|
+-------+-----------------+-------------------+------------------+
If I don't specify inferSchema
, then all the columns are treated as string.
scala> val dataNoSchema = spark.read.option("header", "true").csv("./data/flight-data/csv/2015-summary.csv")
dataNoSchema: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> dataNoSchema.printSchema
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: string (nullable = true)
Question 1) Why do then Spark
gives mean
and stddev
values for the last column count
scala> dataNoSchema.describe().show();
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-------+-----------------+-------------------+------------------+
| count| 256| 256| 256|
| mean| null| null| 1770.765625|
| stddev| null| null|23126.516918551915|
| min| Algeria| Angola| 1|
| max| Zambia| Vietnam| 986|
+-------+-----------------+-------------------+------------------+
Question 2 ) If Spark
now interprets count
as numeric
column then why the max
value is 986 and not 37002 (as is in data DataFrame)