I am a beginner who has just started using spark. I executed the following query in pySpark (Scala 2.11.8)
dic = [{"a":1},{"b":2},{"c":3}]
spark.parallelize(dic).toDF()
df.show()
Which then produces:
+----+
| a|
+----+
| 1|
|null|
|null|
+----+
Whereas when I execute spark.createDataFrame(dic).show()
it produces
+----+----+----+
| a| b| c|
+----+----+----+
| 1|null|null|
|null| 2|null|
|null|null| 3|
+----+----+----+
Based on Unable to use rdd.toDF() but spark.createDataFrame(rdd) Works it seems that toDF() is syntactic sugar for createDataFrame but the post doesn't elaborate on what's going on internally which causes the difference. Just wondering if anyone could kindly explain the reason behind the above-mentioned result.
Thanks!