Strange behaviour of pyspark's toDF() vs createDataFrame()

Question

I am a beginner who has just started using spark. I executed the following query in pySpark (Scala 2.11.8)

dic = [{"a":1},{"b":2},{"c":3}]

spark.parallelize(dic).toDF()
df.show()

Which then produces:

+----+                                                                          
|   a|
+----+
|   1|
|null|
|null|
+----+

Whereas when I execute spark.createDataFrame(dic).show() it produces

+----+----+----+                                                                
|   a|   b|   c|
+----+----+----+
|   1|null|null|
|null|   2|null|
|null|null|   3|
+----+----+----+

Based on Unable to use rdd.toDF() but spark.createDataFrame(rdd) Works it seems that toDF() is syntactic sugar for createDataFrame but the post doesn't elaborate on what's going on internally which causes the difference. Just wondering if anyone could kindly explain the reason behind the above-mentioned result.

Thanks!

score 3 · Accepted Answer · answered Apr 09 '18 at 09:10

First of all - if you check the log you'll see following warning:

UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated,

It is there for the reason.

And the explanation for the observed behavior is simple - schema inference logic is different for local collection (where we can safely assume that we can scan all records in negligible time) and RDD (where this assumption is necessarily true).

The later one uses _inferSchema, which samples data for inference. If sampling ratio is not provided, it uses only the first row. In contrast, with local collection, Spark scans all records.

Take away message here is to read warnings and not to depend on schema inference (which is quite often unreliable and expensive).

Strange behaviour of pyspark's toDF() vs createDataFrame()

1 Answers1