5

I want to load a struct from a database collection, and attach it as a constant column to every row in a target DataFrame.

I can load the column I need as a DataFrame with one row, then do a crossJoin to paste it onto each row of the target:

val parentCollectionDF = /* ... load a single row from the database */
val constantCol = broadcast(parentCollectionDF.select("my_column"))
val result = childCollectionDF.crossJoin(constantCol)

It works but feels wasteful: the data is constant for each row of the child collection, but the crossJoin copies it to each row.

If I could hardcode the values, I could use something like childCollection.withColumn("my_column", struct(lit(val1) as "field1", lit(val2) as "field2" /* etc. */)) But I don't know them ahead of time; I need to load the struct from the parent collection.

What I'm looking for is something like:

childCollection.withColumn("my_column",
  lit(parentCollectionDF.select("my_column").take(1).getStruct(0))

... but I can see from the code for literals that only basic types can be used as an argument to lit(). No good to pass a GenericRowWithSchema or a case class here.

Is there a less clumsy way to do this? (Spark 2.1.1, Scala)

[edit: Not the same as this question, which explains how to add a struct with literal (hardcoded) constants. My struct needs to be loaded dynamically.]

MZS
  • 573
  • 4
  • 12
  • 2
    Possible duplicate of [How to add a constant column in a Spark DataFrame?](https://stackoverflow.com/questions/32788322/how-to-add-a-constant-column-in-a-spark-dataframe) – zero323 Jun 12 '17 at 13:23

0 Answers0