how to modify one column value in one row used by pyspark

Question

I want to update value when userid=22650984.How to do it in pyspark platform?thank you for helping.

>>>xxDF.select('userid','registration_time').filter('userid="22650984"').show(truncate=False)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 0.1 in stage 57.0 (TID 874, shopee-hadoop-slave89, executor 9): TaskKilled (killed intentionally)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 11.1 in stage 57.0 (TID 875, shopee-hadoop-slave97, executor 16): TaskKilled (killed intentionally)
+--------+----------------------------+
|userid  |registration_time           |
+--------+----------------------------+
|22650984|270972-04-26 13:14:46.345152|
+--------+----------------------------+

score 28 · Answer 1 · answered Apr 09 '18 at 14:07

If you want to modify a subset of your DataFrame and keep the rest unchanged, the best option would be to use pyspark.sql.functions.when() as using filter or pyspark.sql.functions.where() would remove all rows where the condition is not met.

from pyspark.sql.functions import col, when

valueWhenTrue = None  # for example

df.withColumn(
    "existingColumnToUpdate",
    when(
        col("userid") == 22650984,
        valueWhenTrue
    ).otherwise(col("existingColumnToUpdate"))
)

When will evaluate the first argument as a boolean condition. If the condition is True, it will return the second argument. You can chain together multiple when statements as shown in this post and also this post. Or use otherwise() to specify what to do when the condition is False.

In this example, I am updating an existing column "existingColumnToUpdate". When the userid is equal to the specified value, I will update the column with valueWhenTrue. Otherwise, we will keep the value in the column unchanged.

exactly what I was looking for! This approach is very similar to Pandas style. Different but understandable to those who Panda... — Michael Colon, Dec 03 '18 at 00:00

score 0 · Answer 2 · answered Nov 06 '20 at 10:17

0

Change Value of a Dataframe Column Based on a Filter:

from pyspark.sql.functions import lit new_df = xxDf.filter(xxDf.userid == "22650984").withColumn('clumn_to update', lit(<update_expression>)

answered Nov 06 '20 at 10:17

Aashish Ranjan

469
1
4
4

score -3 · Accepted Answer · answered Apr 08 '18 at 20:49

-3

You can use withColumn to achieve what you are looking to do:

new_df = xxDf.filter(xxDf.userid = "22650984").withColumn(xxDf.field_to_update, <update_expression>)

the update_expression would have your logic for update - could be UDF, or derived field, etc..

answered Apr 08 '18 at 20:49

karthikr

97,368
26
197
188

5

This will not work for two reasons: 1) you need to use == instead of = because you're comparing values not assigning, 2) when using == it will filter out the rest of the df, when the user only wants to change one row – SchwarzeHuhn Dec 02 '19 at 12:34
working code: from pyspark.sql.functions import lit new_df = xxDf.filter(xxDf.userid == "22650984").withColumn('column_to update', lit() – Aashish Ranjan Nov 06 '20 at 10:13

how to modify one column value in one row used by pyspark

3 Answers3