I have been preparing for the CCA175 and have encountered this issue. Was trying to convert a rdd to dataframe and here is my code:
li=[1]
rdd=sc.parallelize(li)
df=rdd.toDF()
The following error throws up:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 58, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 746, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 390, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio, names=schema)
File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 370, in _inferSchema
schema = _infer_schema(first, names=names)
File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\types.py", line 1062, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>
If I try to convert list of integers to a list of tuples of integers it throws the same. But if I create a list in the following it works:
>>> li=[(1,)]
>>> rdd=sc.parallelize(li)
>>> df=rdd.toDF(schema=['col'])
>>> df.show()
+---+
|col|
+---+
| 1|
+---+
I am trying to understand the reason why this is happening. Can anyone please explain.