1

I have two dataframes, df1 and df2, and I'd like to add a new column to the second one. This new column should be the average of a column from the first dataframe. Something like this:

df1                  df2                   df2
userid count value   userid count          userid count value
11     2     5       10     1              10     1     5
22     3     4       20     1     ======>  20     1     5
33     5     6       30     1              30     1     5

I'm trying

df2 = df2.withColumn("value", avg(df1.col("value")));

which is not working. How can I do this? Thank you!

lte__
  • 7,175
  • 25
  • 74
  • 131
  • you need to join both dataframes before you can do any operation. Spark doesn't know how to relate df1 to df2. – Carlos Vilchez Jul 15 '16 at 14:55
  • Oh. That seems tedious, since in the end I'd want to `.unionAll()` them into a single df, but I can't do that until they have the same no of columns... – lte__ Jul 15 '16 at 14:57
  • I think the problem you try to solve is related with http://stackoverflow.com/a/29950853/702002 – Carlos Vilchez Jul 15 '16 at 16:36

1 Answers1

2

It's similar to Append a column to Data Frame in Apache Spark 1.3

withColum() should have a column related to the DateFrame, so you can make a transformation:

  • cal the avg value
  • when adding a new column, set the original value as 0, and then add the avg value

    import org.apache.spark.sql.functions._
    val avgValue = df1.select(avg(df1("value"))).collect()(0).getDouble(0)
    df2 = df2.withColumn("value", rand() * 0 + avgValue)
    
Community
  • 1
  • 1
yanghaogn
  • 833
  • 7
  • 15
  • 1
    Ah! `rand() * 0 + avgValue` is really clever to genereate a column of data from a single value. Thanks! I'll test it on Monday, but I'll trust you on this and accept your answer ;) – lte__ Jul 16 '16 at 15:06