2

I'm trying to convert an RDD[String] to a Dataframe. The string is comma-separated, so I would like to get one column for each value between commas. To do so, I've tried these steps:

val allNewData_split = allNewData.map(e => e.split(",")) //RDD[Array[String]]
val df_newData = allNewData_split.toDF()  //DataFrame

But I'm getting this:

+--------------------+
|               value|
+--------------------+
|[0.0, 0.170716979...|
|[0.0, 0.272535901...|
|[0.0, 0.232002948...|
+--------------------+

It is not a duplicate of this post (How to convert rdd object to dataframe in spark) due to I'm asking for RDD[String] instead of RDD[Row].

And it also isn't a duplicate of Spark - load CSV file as DataFrame? because this question isn't about reading a CSV file as DataFrame.

diens
  • 639
  • 8
  • 26

1 Answers1

2

If all your array have the same size, you can transform the array to columns like this using apply on Column:

val df = Seq(
  Array(1,2,3),
  Array(4,5,6)
).toDF("arr")

df.show()

+---------+
|      arr|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
+---------+

val ncols = 3

val selectCols = (0 until  ncols).map(i => $"arr"(i).as(s"col_$i"))

df
  .select(selectCols:_*)
  .show()

+-----+-----+-----+
|col_0|col_1|col_2|
+-----+-----+-----+
|    1|    2|    3|
|    4|    5|    6|
+-----+-----+-----+
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145