why is rdd.toDF(schema) working in Spark 2.1?

Asked Jun 17 '17 at 15:00

Active Jun 17 '17 at 15:19

Viewed 227 times

I am working with Spark 2.1 in Python. I am able to convert an RDD to a DataFrame using the toDF() method. (spark is the spark session initialized earlier)

rdd = spark.read.text(sys.argv[1]).rdd.map(lambda l: l[0].replace("24:00", "00:00") if "24:00" in l[0] else l[0])

fields = [StructField("datetime", StringType(), True),
          StructField("temperature", DecimalType(scale = 3), True),
          StructField("humidity", DecimalType(scale = 1), True)]

schema = StructType(fields)

df = rdd.map(lambda k: k.split(",")).map(lambda p: (p[0][5:-3], Decimal(p[5]), Decimal(p[6]))).toDF(schema)

But I cannot find where this is in the API docs. So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?

asked Jun 17 '17 at 15:00

Yiannis

@zero323 I do not see how this is a duplicate? – Yiannis Jun 17 '17 at 16:07
You asked where the method come from, so the is an exact explanation there. – zero323 Jun 17 '17 at 16:11
1

@zero323 you are right, makes sense now, however, where can I find that in the documentation? Thanks – Yiannis Jun 17 '17 at 16:31
Since patch is applied on runtime, and only once there is an instance of `SparkSession` / `SQLContext`, it won't be included in the docs. But it is effectively equivalent to calling `SparkSession.createDataFrame`, so you can use its docs as a reference. – zero323 Jun 17 '17 at 16:36

why is rdd.toDF(schema) working in Spark 2.1?

0 Answers0