1

I have a datafame and would like to add columns to it, based on values from a list.

The list of my values will vary from 3-50 values. I'm new to pySpark and I'm trying to append these values as new columns (empty) to my df.

I've seen recommended code of how to add [one column][1] to a dataframe but not multiple from a list.

mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName', 'ConformedLeaseTypeName']

My code below only appends one column.

for new_col in mylist:
  new = datasetMatchedDomains.withColumn(new_col,f.lit(0))
new.show()




  [1]: https://stackoverflow.com/questions/48164206/pyspark-adding-a-column-from-a-list-of-values-using-a-udf
jgtrz
  • 365
  • 6
  • 19

2 Answers2

1

You can just go through a list in a loop, updating your df:

for col_name in mylist:
    datasetMatchedDomains = datasetMatchedDomains.withColumn(col_name, lit(0))

Interesting follow-up - if that works, try doing it with reduce :)

P.S. Regarding your edit - withColumn is not modifying original DataFrame, but returns a new one every time, which you're overwriting with each loop iteration.

Rayan Ral
  • 1,862
  • 2
  • 17
  • 17
  • 1
    your suggested code gives me the desire output, thanks! and thanks for the `withColumn` explanation. What's your suggestion with `reduce`? – jgtrz May 12 '20 at 17:41
  • I mean, you can rewrite it in a slightly more functional style (though it's just a style preference, totally up to you and won't affect performance in any way): you can try doing something like: `functools.reduce(lambda df, col_name: df.withColumn(col_name), mylist)` (it's more pseudocode here, I can't write it correctly from the top of my head) – Rayan Ral May 12 '20 at 17:42
  • gotcha. Another question, after re-running the code a couple times, now it only prints the last list value as a column...@Rayan Ral – jgtrz May 12 '20 at 17:55
  • I have another question. Cant figurre out what I'm missing . https://stackoverflow.com/questions/62070186/pyspark-mapping-values-from-different-dataframes @Rayan Ral – jgtrz May 28 '20 at 22:13
1

We can also use list comprehension with .select to add new columns to the dataframe.

Example:

#sample dataframe
df.show()
#+---+-----+---+---+----+
#| _1|   _2| _3| _4|  _5|
#+---+-----+---+---+----+
#|   |12343|   |9  |   0|
#+---+-----+---+---+----+

mylist = ['ConformedLeaseRecoveryTypeId', 'ConformedLeaseStatusId', 'ConformedLeaseTypeId', 'ConformedLeaseRecoveryTypeName', 'ConformedLeaseStatusName', 'ConformedLeaseTypeName']

cols=[col(col_name) for col_name in df.columns] + [(lit(0)).name( col_name) for col_name in mylist]

#incase if you want to cast new fields then
cols=[col(col_name) for col_name in df.columns] + [(lit(0).cast("string")).name( col_name) for col_name in mylist]

#adding new columns and selecting existing columns    
df.select(cols).show()
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
#| _1|   _2| _3| _4|  _5|ConformedLeaseRecoveryTypeId|ConformedLeaseStatusId|ConformedLeaseTypeId|ConformedLeaseRecoveryTypeName|ConformedLeaseStatusName|ConformedLeaseTypeName|
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
#|   |12343|   |9  |   0|                           0|                     0|                   0|                             0|                       0|                     0|
#+---+-----+---+---+----+----------------------------+----------------------+--------------------+------------------------------+------------------------+----------------------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • 1
    thanks for the alternative solution & explanation! I'm going with this answer since using `withColumn` only appended one value from the list. @Shu – jgtrz May 12 '20 at 18:08
  • I have another question, thanks in advance! https://stackoverflow.com/questions/61787976/populating-column-in-dataframe-with-pyspark @Shu – jgtrz May 14 '20 at 02:01
  • I have a follow-up question, dropping the link, thanks in advance! https://stackoverflow.com/questions/61823544/pyspark-mapping-multiple-columns @Shu – jgtrz May 15 '20 at 19:54
  • I have another question. Can't figure out what I'm missing. Thanks in advance! https://stackoverflow.com/questions/62070186/pyspark-mapping-values-from-different-dataframes – jgtrz May 28 '20 at 22:14