I have a pyspark dataframe:
Example:
text <String> | name <String> | original_name <String>
----------------------------------------------------------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019
----------------------------------------------------------------------------
NATUREISVERYGOODFORHEALTH | null | null
----------------------------------------------------------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D
----------------------------------------------------------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH
----------------------------------------------------------------------------
I want to loop the name
column and look if name
values exists in text
, if yes, I create a new_column
, will be contain the original_name
value of the name
values found in text
. Knowing that some times the dataframe columns are null
.
Example:
in the line 4 in the dataframe example, the
text
contain 2 values fromname
column:[OURHEALTH, VITAMIND]
, I should take itsoriginal_name
values and store them in anew_column
.in the line 2, the
text
containOURHEALTH
fromname
column, I should store in thenew_column
the originalname
value that found ==>[OUR_/HEALTH]
Expect result:
text <String> | name <String> | original_name <String> | new_column <Array>
------------------------------|------------------|---------------------------|----------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019 | [WORLD_2019]
------------------------------|------------------|---------------------------|----------------------------
NATUREISVERYGOODFOROURHEALTH | null | null | [OUR_/HEALTH]
------------------------------|------------------|---------------------------|----------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D | [VITAMIN_D]
------------------------------|------------------|---------------------------|----------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH | [OUR_/HEALTH, VITAMIN_D ]
-----------------------------------------------------------------------------|----------------------------
I hope that I was clear in my explanation.
I tried by the following code:
df = df.select("text", "name", "original_name").agg(collect_set("name").alias("name_array"))
for name_item in name_array:
df.withColumn("new_column", F.when(df.text.contains(name_item), "original_name").otherwise(None))
Someone can help me please ? Thank you