How to convert a arbitrary-length Array[String] to one row DataFrame in spark

Question

I have a arbitrary-length Array[String] like:

val strs = Array[String]("id","value","group","ts")

How can I transfer it to DataFrame looks like:

+-----+------+-------+----+
|_0   | _1   | _2    | _3 |
+-----+------+-------+----+
|   id| value| group | ts |

The solutions I tried:

code:

spark.sparkContext.parallelize(List((strs.toList))).toDF().show()

or

spark.sparkContext.parallelize(List(strs)).toDF().show()

result:

+--------------------+
|               value|
+--------------------+
|[id, value, group...|
+--------------------+

code:

spark.sparkContext.parallelize(strs).toDF().show()

result:

+-----+
|value|
+-----+
|   id|
|value|
|group|
|   ts|
+-----+

Not really I want.

I know the solution as:

 val data1 = List(
      (1,"A","X",1),
      (2,"B","X",2),
      (3,"C",null,3),
      (3,"D","C",3),
      (4,"E","D",3)
    ).toDF("id","value","group","ts").show()

But in my case, the Array[String] is arbitrary-length

Possible duplicate of [Convert List into dataframe spark scala](https://stackoverflow.com/questions/41867147/convert-list-into-dataframe-spark-scala) — Shaido, Feb 28 '19 at 05:22
The issue is that he the poster starts with a variable for Array[String], and does not want to rewrite to embed the sequence directly as `.parallelize(List("a","b","c"))`. That would constitute hardcoding ... at least that is what I guess the intent is. Henced the referenced posting would also not answer. — YoYo, Feb 28 '19 at 15:46
@YoYo Yes, you are right, `List("a","b","c")` is not work here, because the Array or List is arbitrary-length, we don't know the length and values — , Feb 28 '19 at 16:52

YoYo · Accepted Answer · 2019-02-28T17:46:04.527

val strs = Array[String]("id","value","group","ts")
val list_of_strs  = List[Array[String]]() :+ strs
spark.sparkContext.parallelize(list_of_strs)
  .map { case Array(s1,s2,s3,s4) => (s1,s2,s3,s3) }
  .toDF().show()

The issue is apparently creating a list with a single element, when that element is also a collection. I guess the solution would be to create an empty list first, and then add the single element.

As with the updates it appears to be the issue that we are not dealing with tuples, this might also work:

val strs = Array[String]("id","value","group","ts")
spark.sparkContext.parallelize(List(strs))
  .map { case Array(s1,s2,s3,s4) => (s1,s2,s3,s3) }
  .toDF().show()

But I do not think that we can deal with an Array of arbitrary length, as that would result in a Tuple with Arbitrary length ... That does not make sense, as for a DataFrame we are also dealing with Rows of a fixed definition (Number of Columns and Column Types). If that really happens, you are going to have to fill remaining tuple items with blanks and work with a largest tuple.

Your code get the same result as: `+--------------------+ | value| +--------------------+ |[id, value, group...| +--------------------+` — , Feb 28 '19 at 16:41
Made an update in an attempt to first map the array to a tuple. — YoYo, Feb 28 '19 at 17:37

How to convert a arbitrary-length Array[String] to one row DataFrame in spark

1 Answers1