-2

I've converted RDDs to Data Frame using case classes, but my current data has 700 columns. I've come across mentions of using structtypes but I can't find an example. Hope someone can share an example here. Thank you. Kevin

kevin0259
  • 1
  • 1

1 Answers1

-2

Here is an example with sample input using structType :

a,1,2.0

b,2,3.0

import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.DoubleType

def getSchema(): StructType = {
val schema = new StructType(Array(
  StructField("col_a", StringType, nullable = true),
  StructField("col_b", IntegerType, nullable = true),
  StructField("col_c", DoubleType, nullable = true)
))
schema
}

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.textFile("/tmp/test").map(m => m.split(",", -1)).map(m => Row(m(0),m(1).toInt,m(2).toDouble))
val df = sqlContext.createDataFrame(rdd, getSchema)
df.show
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|    a|    1|  2.0|
|    b|    2|  3.0|
+-----+-----+-----+
m-bhole
  • 1,189
  • 10
  • 21
  • This is exactly what op has asked.... not sure why people are downvoting. Please leave the comment/reason for downvoting. – m-bhole Aug 13 '16 at 03:43
  • Thank you for your answer hadooper. I don't know why people down voted. There are many answers using case class, but that is limited to ~20 columns. I couldn't find a clear example on using the struct, so thank you so much for taking your time to answer my question. - kevin – kevin0259 Aug 13 '16 at 21:07
  • case classes are limited to 22 columns if you are using scala 2.10. This limitation has been removed from scala 2.11. So if you want to use case classes, you will have to use scala 2.11 – m-bhole Aug 14 '16 at 04:50