3
+-----------+---+
|       Name|Age|
+-----------+---+
|Emma Larter| 34|
| Mia Junior| 59|
|Sophia Depp| 32|
|James Smith| 40|
+-----------+---+

I have a spark dataframe as above. I want to append a column to the dataframe using below list:

Salary = [35000, 24000, 55000, 40000]

How to do it in simple way using spark?

I can do this with pandas, but not spark.

notNull
  • 30,258
  • 4
  • 35
  • 50

2 Answers2

3

Using Pyspark use zipWithIndex function to generate the index column and use it join.

Example:

from pyspark.sql.functions import *
from pyspark.sql.types import *
df= spark.createDataFrame([('Emma Larter',34),('Mia Junior',59),('Sophia',32),('James',40)],['Name','Age'])
df_ind = spark.createDataFrame(df.rdd.zipWithIndex(),['val','ind'])
Salary = [35000, 24000, 55000, 40000]
df_salary = spark.createDataFrame(spark.createDataFrame(Salary, IntegerType()).rdd.zipWithIndex(),['val1','ind'])
df_ind.join(df_salary,['ind']).select("val.*","val1.*").drop('ind').show()

#+-----------+---+-----+
#|       Name|Age|value|
#+-----------+---+-----+
#|Emma Larter| 34|35000|
#| Mia Junior| 59|24000|
#|     Sophia| 32|55000|
#|      James| 40|40000|
#+-----------+---+-----+
notNull
  • 30,258
  • 4
  • 35
  • 50
1

You could easily convert your pyspark dataframe into Pandas using toPandas() method and then append the new column to it

from pyspark.shell import spark

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])

new_pandas_df = df.toPandas()
new_pandas_df['gender'] = ['M', 'F', 'M']

print(new_pandas_df)

Output: enter image description here

Please note i've used some test dataframe in my answer, but please change it according to yours. Also pandas has it own downfall since it does all the processing in memory, so consider this when you're doing it for a larger dataset.

Kulasangar
  • 9,046
  • 5
  • 51
  • 82