0

I have a Pyspark dataframe df, like following:

+---+----+---+
| id|name|  c|
+---+----+---+
|  1|   a|  5|
|  2|   b|  4|
|  3|   c|  2|
|  4|   d|  3|
|  5|   e|  1|
+---+----+---+

I want to add a column match_name that have value from the name column where id == c

Is it possible to do it with function withColumn()?

Currently i have to create two dataframes and then perform join. Which is inefficient on large dataset.

Expected Output:

+---+----+---+----------+
| id|name|  c|match_name|
+---+----+---+----------+
|  1|   a|  5|         e|
|  2|   b|  4|         d|
|  3|   c|  2|         b|
|  4|   d|  3|         c|
|  5|   e|  1|         a|
+---+----+---+----------+
Jugraj Singh
  • 529
  • 1
  • 6
  • 22

1 Answers1

2

Yes, it is possible, with when:

from pyspark.sql.functions import when, col

condition = col("id") == col("match")
result = df.withColumn("match_name", when(condition, col("name"))

result.show()

id name match match_name
1  a    3     null
2  b    2     b
3  c    5     null
4  d    4     d
5  e    1     null

You may also use otherwise to provide a different value if the condition is not met.

pansen
  • 6,433
  • 4
  • 19
  • 32
  • ... faster by some seconds (+1)... :) – desertnaut Nov 03 '17 at 12:07
  • ` from pyspark.sql import functions as func df = spark.createDataFrame([(1, 'a', 5), (2, 'b', 4), (3, 'c', 2), (4, 'd', 3), (5, 'e', 1)], ['id', 'name', 'c']) condition = func.col("id") == func.col("c") result = df.withColumn("match_name", func.when(condition, func.col("name"))) result.show()` – Jugraj Singh Nov 03 '17 at 12:08
  • Sir i want to take name of the column where id is `3` and add it across the row having id `1` – Jugraj Singh Nov 03 '17 at 12:11
  • @JugrajSingh 1) please, do not put long code snippets in the comments - they are unreadable 2) of course you do, because in these (new) data the condition is nowhere matched! I have confirmed that the solution with the data you provided **originally** in your post is indeed as shown and as expected... – desertnaut Nov 03 '17 at 12:13
  • agreed and apologies but my requirement is different. it is to fill all rows with ther possible match_name like join does. – Jugraj Singh Nov 03 '17 at 12:16
  • please keep in mind the value of the column come from a different row in the same dataset – Jugraj Singh Nov 03 '17 at 12:17
  • @JugrajSingh then describe it accurately, including the desired output!! The answer is **exactly** what you have asked for – desertnaut Nov 03 '17 at 12:17