17

I want to update value when userid=22650984.How to do it in pyspark platform?thank you for helping.

>>>xxDF.select('userid','registration_time').filter('userid="22650984"').show(truncate=False)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 0.1 in stage 57.0 (TID 874, shopee-hadoop-slave89, executor 9): TaskKilled (killed intentionally)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 11.1 in stage 57.0 (TID 875, shopee-hadoop-slave97, executor 16): TaskKilled (killed intentionally)
+--------+----------------------------+
|userid  |registration_time           |
+--------+----------------------------+
|22650984|270972-04-26 13:14:46.345152|
+--------+----------------------------+
karthikr
  • 97,368
  • 26
  • 197
  • 188
Frank
  • 977
  • 3
  • 14
  • 35

3 Answers3

28

If you want to modify a subset of your DataFrame and keep the rest unchanged, the best option would be to use pyspark.sql.functions.when() as using filter or pyspark.sql.functions.where() would remove all rows where the condition is not met.

from pyspark.sql.functions import col, when

valueWhenTrue = None  # for example

df.withColumn(
    "existingColumnToUpdate",
    when(
        col("userid") == 22650984,
        valueWhenTrue
    ).otherwise(col("existingColumnToUpdate"))
)

When will evaluate the first argument as a boolean condition. If the condition is True, it will return the second argument. You can chain together multiple when statements as shown in this post and also this post. Or use otherwise() to specify what to do when the condition is False.

In this example, I am updating an existing column "existingColumnToUpdate". When the userid is equal to the specified value, I will update the column with valueWhenTrue. Otherwise, we will keep the value in the column unchanged.

pault
  • 41,343
  • 15
  • 107
  • 149
  • 1
    exactly what I was looking for! This approach is very similar to Pandas style. Different but understandable to those who Panda... – Michael Colon Dec 03 '18 at 00:00
0

Change Value of a Dataframe Column Based on a Filter:

from pyspark.sql.functions import lit new_df = xxDf.filter(xxDf.userid == "22650984").withColumn('clumn_to update', lit(<update_expression>)

Aashish Ranjan
  • 469
  • 1
  • 4
  • 4
-3

You can use withColumn to achieve what you are looking to do:

new_df = xxDf.filter(xxDf.userid = "22650984").withColumn(xxDf.field_to_update, <update_expression>)

the update_expression would have your logic for update - could be UDF, or derived field, etc..

karthikr
  • 97,368
  • 26
  • 197
  • 188
  • 5
    This will not work for two reasons: 1) you need to use == instead of = because you're comparing values not assigning, 2) when using == it will filter out the rest of the df, when the user only wants to change one row – SchwarzeHuhn Dec 02 '19 at 12:34
  • working code: from pyspark.sql.functions import lit new_df = xxDf.filter(xxDf.userid == "22650984").withColumn('column_to update', lit() – Aashish Ranjan Nov 06 '20 at 10:13