Why does PySpark convert a li=[(1,)] to a dataframe but not li=[(1)]. rdd to dataframe conversion

Question

I have been preparing for the CCA175 and have encountered this issue. Was trying to convert a rdd to dataframe and here is my code:

li=[1]

rdd=sc.parallelize(li)

df=rdd.toDF()

The following error throws up:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 58, in toDF
    return sparkSession.createDataFrame(self, schema, sampleRatio)
  File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 746, in createDataFrame
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 390, in _createFromRDD
    struct = self._inferSchema(rdd, samplingRatio, names=schema)
  File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\session.py", line 370, in _inferSchema
    schema = _infer_schema(first, names=names)
  File "C:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\types.py", line 1062, in _infer_schema
    raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>

If I try to convert list of integers to a list of tuples of integers it throws the same. But if I create a list in the following it works:

>>> li=[(1,)]
>>> rdd=sc.parallelize(li)
>>> df=rdd.toDF(schema=['col'])
>>> df.show()
+---+
|col|
+---+
|  1|
+---+

I am trying to understand the reason why this is happening. Can anyone please explain.

A list of tuples/lists is needed to create a dataframe. A list of ints/floats does not work. — mck, Mar 03 '21 at 12:05
Mck, thanks for the reply. If that is the case, li=[(1)] should also work right? that doesn't work. — sk79, Mar 03 '21 at 13:09
No, it will not. (1) is not a tuple. It will be evaluated to 1 by Python. You need to add a comma to indicate that it is a tuple with 1 element. — mck, Mar 03 '21 at 13:10
See [this](https://stackoverflow.com/questions/12876177/how-to-create-a-tuple-with-only-one-element) to learn more. — mck, Mar 03 '21 at 13:11

Why does PySpark convert a li=[(1,)] to a dataframe but not li=[(1)]. rdd to dataframe conversion

0 Answers0