1

I have the following data (just showing a snippet)

DEST_COUNTRY_NAME   ORIGIN_COUNTRY_NAME count
United States   Romania 15
United States   Croatia 1
United States   Ireland 344
Egypt   United States   15

I read it with inferSchema option set to true and then describe the columns. It seem to work fine.

scala> val data = spark.read.option("header", "true").option("inferSchema","true").csv("./data/flight-data/csv/2015-summary.csv")
scala> data.describe().show()
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|             count|
+-------+-----------------+-------------------+------------------+
|  count|              256|                256|               256|
|   mean|             null|               null|       1770.765625|
| stddev|             null|               null|23126.516918551915|
|    min|          Algeria|             Angola|                 1|
|    max|           Zambia|            Vietnam|            370002|
+-------+-----------------+-------------------+------------------+

If I don't specify inferSchema, then all the columns are treated as string.

scala> val dataNoSchema = spark.read.option("header", "true").csv("./data/flight-data/csv/2015-summary.csv")
dataNoSchema: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> dataNoSchema.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: string (nullable = true)

Question 1) Why do then Spark gives mean and stddev values for the last column count

scala> dataNoSchema.describe().show();
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|             count|
+-------+-----------------+-------------------+------------------+
|  count|              256|                256|               256|
|   mean|             null|               null|       1770.765625|
| stddev|             null|               null|23126.516918551915|
|    min|          Algeria|             Angola|                 1|
|    max|           Zambia|            Vietnam|               986|
+-------+-----------------+-------------------+------------------+

Question 2 ) If Spark now interprets count as numeric column then why the max value is 986 and not 37002 (as is in data DataFrame)

Manu Chadha
  • 15,555
  • 19
  • 91
  • 184

1 Answers1

0

Spark SQL aspires to be SQL standard compliant, therefore uses the same evaluation rules, and if needed, transparently coerces types to satisfy the expression (see for example my answer to PySpark DataFrames - filtering using comparisons between columns of different types).

It means that max and mean / stddev cases are simply not equivalent:

  • maximum is meaningful for strings (with lexicographic ordering) and no coercion is required

    Seq.empty[String].toDF("count").agg(max("count")).explain
    
    == Physical Plan ==
    SortAggregate(key=[], functions=[max(count#69)])
    +- Exchange SinglePartition
       +- SortAggregate(key=[], functions=[partial_max(count#69)])
          +- LocalTableScan <empty>, [count#69]
    
  • average or standard deviation are not, and the argument is casted to double

    Seq.empty[String].toDF("count").agg(mean("count")).explain
    
    == Physical Plan ==
    *(2) HashAggregate(keys=[], functions=[avg(cast(count#81 as double))])
    +- Exchange SinglePartition
       +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(count#81 as double))])
          +- LocalTableScan <empty>, [count#81].
    
10465355
  • 4,481
  • 2
  • 20
  • 44
  • It is not. I am saying that `max` with `String` argument uses [lexicographical order](https://en.wikipedia.org/wiki/Lexicographical_order), because such operation is defined, and doesn't require casting. – 10465355 Feb 08 '19 at 00:00