PySpark add new column to dataframe with new list

Question

Based on previous questions: 1, 2. Suppose I have the following dataframe:

df = spark.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], 
     ("x1", "x2", "x3"))

And I want to add new column x4 but I have value in a list of Python instead to add to the new column e.g. x4_ls = [35.0, 32.0]. Is there a best way to add new column to the Spark dataframe? (note that I use Spark 2.1)

Output should be something like:

## +---+---+-----+----+
## | x1| x2|   x3|  x4|
## +---+---+-----+----+
## |  1|  a| 23.0|35.0|
## |  3|  B|-23.0|32.0|
## +---+---+-----+----+

I can also transform my list to dataframe df_x4 = spark.createDataFrame([Row(**{'x4': x}) for x in x4_ls]) (but I don't how to concatenate dataframe together)

Gaurav Dhama · Answer 1 · 2017-02-15T21:13:54.707

We can concatenate on the basis of rownumbers as follows. Suppose we have two dataframes df and df_x4 :

def addrownum(df):
    dff = df.rdd.zipWithIndex().toDF(['features','rownum'])
    odf = dff.map(lambda x : tuple(x.features)+tuple([x.rownum])).toDF(df.columns+['rownum'])
    return odf

df1 = addrownum(df)
df2 = addrownum(df_x4)

outputdf = df1.join(df2,df1.rownum==df2.rownum).drop(df1.rownum).drop(df2.rownum)

## outputdf
## +---+---+-----+----+
## | x1| x2|   x3|  x4|
## +---+---+-----+----+
## |  1|  a| 23.0|35.0|
## |  3|  B|-23.0|32.0|
## +---+---+-----+----+

outputdf is your required output dataframe

score 0 · Accepted Answer · answered Feb 15 '17 at 20:02

Thanks to Gaurav Dhama for a great answer! I made changes a little with his solution. Here is my solution which join two dataframe together on added new column row_num.

from pyspark.sql import Row

def flatten_row(r):
    r_ =  r.features.asDict()
    r_.update({'row_num': r.row_num})
    return Row(**r_)

def add_row_num(df):
    df_row_num = df.rdd.zipWithIndex().toDF(['features', 'row_num'])
    df_out = df_row_num.rdd.map(lambda x : flatten_row(x)).toDF()
    return df_out

df = add_row_num(df)
df_x4 = add_row_num(df_x4)
df_concat = df.join(df_x4, on='row_num').drop('row_num')

Seems like there is no easy way to concatenate dataframe or add values to dataframe right now. — titipata, Feb 15 '17 at 20:13

PySpark add new column to dataframe with new list

2 Answers2