I have the following RDD and many just like it:
val csv = sc.parallelize(Array(
"col1, col2, col3",
"1, cat, dog",
"2, bird, bee"))
I would like to convert the RDD into a dataframe where the schema is created dynamically/programmatically based on the first row of the RDD.
I would like to apply the logic to multiple similar like RDDs and cannot specify the schema programmatically using a case class nor use spark-csv
to load the data in as a dataframe from the start.
I've created a flattened dataframe, but am wondering how to breakout the respective columns when creating the dataframe?
Current code:
val header= file.first()
val data = file.mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.toDF(header).show()
Current output:
+----------------+
|col1, col2, col3|
+----------------+
| 1, cat, dog|
| 2, bird, bee|
+----------------+