1

I am working with Spark 2.1 in Python. I am able to convert an RDD to a DataFrame using the toDF() method. (spark is the spark session initialized earlier)

rdd = spark.read.text(sys.argv[1]).rdd.map(lambda l: l[0].replace("24:00", "00:00") if "24:00" in l[0] else l[0])

fields = [StructField("datetime", StringType(), True),
          StructField("temperature", DecimalType(scale = 3), True),
          StructField("humidity", DecimalType(scale = 1), True)]

schema = StructType(fields)

df = rdd.map(lambda k: k.split(",")).map(lambda p: (p[0][5:-3], Decimal(p[5]), Decimal(p[6]))).toDF(schema)

But I cannot find where this is in the API docs. So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?

Yiannis
  • 37
  • 8
  • @zero323 I do not see how this is a duplicate? – Yiannis Jun 17 '17 at 16:07
  • You asked where the method come from, so the is an exact explanation there. – zero323 Jun 17 '17 at 16:11
  • 1
    @zero323 you are right, makes sense now, however, where can I find that in the documentation? Thanks – Yiannis Jun 17 '17 at 16:31
  • Since patch is applied on runtime, and only once there is an instance of `SparkSession` / `SQLContext`, it won't be included in the docs. But it is effectively equivalent to calling `SparkSession.createDataFrame`, so you can use its docs as a reference. – zero323 Jun 17 '17 at 16:36

0 Answers0