0

I have a dataframe like so:

+-------+----------------+
|Name   |Source          |
+-------+----------------+
|Tom    |clientA-incoming|
|Dick   |clientB-incoming|
|Harry  |c-abc-incoming  |

and I would like to add a column slug to end up with this dataframe:

+-------+----------------+--------+
|Name   |Source          |slug    |
+-------+----------------+--------+
|Tom    |clientA-incoming|clientA |
|Dick   |clientB-incoming|clientB |
|Harry  |c-abc-incoming  |c-abc   |

and I have a list of values that have the slugs in them:

slugs = ['clientA', 'clientB', 'c-abc']

basically im thinking something along the lines of this pseudocode:

for i in slugs:
    if i in df['Source']:
        df['Slug'] = i

can someone help me cross the finish line?

EDIT:

I want to update the slug column with a value from the slugs list. The specific value that goes into the slug column is determined based off the Source column.

For example, since slugs[0] = 'clientA' and clientA is a substring of clientA-incoming, I would like to update that row's value in the slug column to clientA

DBA108642
  • 1,995
  • 1
  • 18
  • 55
  • So presumably you'll have some rows that aren't in `slugs` and they're different lengths? – roganjosh Feb 20 '20 at 20:41
  • Do u want to populate `slug` column based on `Source` column? or from `slugs` list..? – notNull Feb 20 '20 at 20:43
  • every row has a corresponding value in `slugs` but there will be values in `slugs` that dont have a row – DBA108642 Feb 20 '20 at 20:43
  • @Shu I want to populate `slug` using data from the list `slugs` based on values from `Source` – DBA108642 Feb 20 '20 at 20:44
  • Use a list comp with `coalesce`, `contains`, (or `startswith`) and `when` to do [if-then-else](https://stackoverflow.com/questions/39048229/spark-equivalent-of-if-then-else) logic: Something like: `df.withColumn('slug', coalesce(*[when(col('Source').contains(slug), lit(slug)) for slug in slugs]))` – pault Feb 20 '20 at 22:14

1 Answers1

2

This can be solved with a left or inner join depending on your requirements:

from pyspark.sql.functions import broadcast

slugs = ['clientA', 'clientB', 'c-abc', 'f-gd']
sdf = spark.createDataFrame(slugs, "string").withColumnRenamed("value", "slug")

df = spark.createDataFrame([
  ["Tom", "clientA-incoming"],
  ["Dick", "clientB-incoming"],
  ["Harry", "c-abc-incoming"],
  ["Harry", "c-dgl-incoming"]
], ["Name", "Source"])

df.join(broadcast(sdf), df["Source"].contains(sdf["slug"]), "left").show()

# +-----+----------------+-------+
# | Name|          Source|   slug|
# +-----+----------------+-------+
# |  Tom|clientA-incoming|clientA|
# | Dick|clientB-incoming|clientB|
# |Harry|  c-abc-incoming|  c-abc|
# |Harry|  c-dgl-incoming|   null|
# +-----+----------------+-------+

Note that we broadcast the smaller df to prevent shuffling.

abiratsis
  • 7,051
  • 3
  • 28
  • 46