8

I want to duplicate a Row in a DataFrame, how can I do that?

For example, I have a DataFrame consisting of 1 Row, and I want to make a DataFrame with 100 identical Rows. I came up with the following solution:

  var data:DataFrame=singleRowDF

   for(i<-1 to 100-1) {
       data = data.unionAll(singleRowDF)
   }

But this introduces many transformations and it seems my subsequent actions become very slow. Is there another way to do it?

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • I don't see why this question should be closed as duplicate because this question is older than the other question... if at all, the other question should be marked as duplicate – Raphael Roth Jun 01 '20 at 19:07

3 Answers3

25

You can add a column with a literal value of an Array with size 100, and then use explode to make each of its elements create its own row; Then, just get rid of this "dummy" column:

import org.apache.spark.sql.functions._

val result = singleRowDF
  .withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
  .selectExpr(singleRowDF.columns: _*)
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • 4
    you could `drop('dummy)` instead of the more complicated `selectExpr` – Wilmerton Nov 03 '16 at 10:25
  • 1
    @TzachZohar Thats great, although I've still problems to understand how it works :) – Raphael Roth Nov 03 '16 at 10:49
  • 1
    How to rewrite this in pyspark? I tried `df.withColumn("dummy", explode(map(lit, range(repeated)))).drop("dummy")`, and it prints out `literals, use 'lit', 'array', 'struct' or 'create ...` error – calvin Jul 02 '20 at 06:18
  • 1
    in PySpark: `df.withColumn("dummy", F.explode(F.array([F.lit(i) for i in range(100)])))` – Robin Zimmerman Dec 02 '22 at 18:01
1

You could pick out the single row, make a list with a hundred elements, populated with that row and convert it back into a dataframe.

import org.apache.spark.sql.DataFrame

val testDf = sc.parallelize(Seq(
    (1,2,3), (4,5,6)
)).toDF("one", "two", "three")

def replicateDf(n: Int, df: DataFrame) = sqlContext.createDataFrame(
    sc.parallelize(List.fill(n)(df.take(1)(0)).toSeq), 
    df.schema)

val replicatedDf = replicateDf(100, testDf)
zenofsahil
  • 1,713
  • 2
  • 16
  • 18
0

You could use a flatMap, or a for-comprehension, like it is described here.

I encourage you to use DataSets every time you can, but if it's not possible, the last example in the link works with DataFrames as well:

val df = Seq(
  (0, "Lorem ipsum dolor", 1.0, List("prp1", "prp2", "prp3"))
).toDF("id", "text", "value", "properties")

val df2 = for {
  row <- df
  p <- row.getAs[Seq[String]]("properties")
} yield (row.getAs[Int]("id"), row.getAs[String]("text"), row.getAs[Double]("value"), p)

Also keep in mind that explode is deprecated, see here.

ruloweb
  • 704
  • 8
  • 10