Spark: How to convert RDD with many columns into a Data Frame

Question

I've converted RDDs to Data Frame using case classes, but my current data has 700 columns. I've come across mentions of using structtypes but I can't find an example. Hope someone can share an example here. Thank you. Kevin

It would help if you showed a reproducible example of what you want. — mtoto, Aug 12 '16 at 10:52
Possible duplicate of [How to convert rdd object to dataframe in spark](http://stackoverflow.com/questions/29383578/how-to-convert-rdd-object-to-dataframe-in-spark) — Alberto Bonsanto, Aug 12 '16 at 11:03

score -2 · Answer 1 · answered Aug 12 '16 at 11:17

Here is an example with sample input using structType :

a,1,2.0

b,2,3.0

import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.DoubleType

def getSchema(): StructType = {
val schema = new StructType(Array(
  StructField("col_a", StringType, nullable = true),
  StructField("col_b", IntegerType, nullable = true),
  StructField("col_c", DoubleType, nullable = true)
))
schema
}

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.textFile("/tmp/test").map(m => m.split(",", -1)).map(m => Row(m(0),m(1).toInt,m(2).toDouble))
val df = sqlContext.createDataFrame(rdd, getSchema)
df.show
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|    a|    1|  2.0|
|    b|    2|  3.0|
+-----+-----+-----+

This is exactly what op has asked.... not sure why people are downvoting. Please leave the comment/reason for downvoting. — m-bhole, Aug 13 '16 at 03:43
Thank you for your answer hadooper. I don't know why people down voted. There are many answers using case class, but that is limited to ~20 columns. I couldn't find a clear example on using the struct, so thank you so much for taking your time to answer my question. - kevin — kevin0259, Aug 13 '16 at 21:07
case classes are limited to 22 columns if you are using scala 2.10. This limitation has been removed from scala 2.11. So if you want to use case classes, you will have to use scala 2.11 — m-bhole, Aug 14 '16 at 04:50

Spark: How to convert RDD with many columns into a Data Frame

1 Answers1