0

I have a dataframe which has a column('target_column' in this case) and I need to update these target columns with 'val' column values.

I have tried using udfs and .withcolumn but they all expect fixed column value. In my case it can be variable. Also using rdd map transformations didn't work as rdds are immutable.

def test():

    data = [("jose_1", 'mase', "firstname", "jane"), ("li_1", "ken", 'lastname', 'keno'), ("liz_1", 'durn', 'firstname', 'liz')]
    source_df = spark.createDataFrame(data, ["firstname", "lastname", "target_column", "val"])
    source_df.show()


if __name__ == "__main__":
    spark = SparkSession.builder.appName('Name Group').getOrCreate()
    test()
    spark.stop()

Input:

+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
|   jose_1|    mase|    firstname|jane|
|     li_1|     ken|     lastname|keno|
|    liz_1|    durn|    firstname| liz|
+---------+--------+-------------+----+

Expected output:

+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
|     jane|    mase|    firstname|jane|
|     li_1|    keno|     lastname|keno|
|      liz|    durn|    firstname| liz|
+---------+--------+-------------+----+

For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row.

Thanks

Carrot
  • 376
  • 5
  • 9
  • target column can take any value ? – Steven Dec 30 '19 at 10:05
  • Does this answer your question? [PySpark- How to use a row value from one column to access another column which has the same name as of the row value](https://stackoverflow.com/questions/48432894/pyspark-how-to-use-a-row-value-from-one-column-to-access-another-column-which-h) – user10938362 Dec 30 '19 at 10:13
  • The target_column has column names as values and these column names should be updated with the corresponding val column value. – Carrot Dec 30 '19 at 10:14
  • For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row. – Carrot Dec 30 '19 at 10:17

1 Answers1

1

You can do a loop with all you columns:

from pyspark.sql import functions as F

for col in df.columns:
    df = df.withColumn(
        col,
        F.when(
            F.col("target_column")==F.lit(col), 
            F.col("val")
        ).otherwise(F.col(col))
    )
Steven
  • 14,048
  • 6
  • 38
  • 73