pyspark search dataframe and randomly select value to add to new dataframe

Question

suppose I have two dataframes:df, grp_df which is df.groupby(["region"]).set_list()

df

 user        item        region
 james        I1         Canada
  amy         I5         Germany
 chris        I33        U.S.

grp_df

  region          Item_lst
  Canada         [I1, I2,... In]
  Germany        [ I3, I5, ... In]
  U.S.           [I33, I22, I11]
  ...             ...

For each user I want to select new item that is not bought before within same region and add it to new pyspark dataframe.

new_df

user      item        region
james       I2        Canada
amy         I3        Germany
chris       I22        U.S.

What is most efficient way to do this in pyspark?

My Approach:

df = df.join(grp_df, ["region"], "left")

def get_neg_sample(item, item_lst):
    return np.random.choice(item_lst.remove(item))

get_neg_sample_udf = udf(get_neg_sample, IntegerType())

df.withColumn("neg_item", get_neg_sample_udf("item", "item_lst))

Sai Kiran KrishnaMurthy · Answer 1 · 2022-01-25T09:05:43.917

1

The function you are looking for is array_contains , you can make use of this with join to get your desired result

val newDf = df.join(gdf,df.col("country")===gdf.col("country") && !array_contains(gdf.col("item_list"),df.col("item")))

edited Jan 25 '22 at 09:05

answered Jan 25 '22 at 08:35

Sai Kiran KrishnaMurthy

717
4
16

What does exclamation mark mean? – haneulkim Jan 25 '22 at 08:38
Sorry and gdf should have "item_lst" column – haneulkim Jan 25 '22 at 08:42
I updated the answer to reflect the correct column name, exclamation is the not operator. Its basically saying array does not contain this value – Sai Kiran KrishnaMurthy Jan 25 '22 at 08:46
hm.. not really understanding this. what is ===? trying to convert to pyspark however keep on failing.. – haneulkim Jan 25 '22 at 09:00
in python it would be just `==` instead of `===` – Sai Kiran KrishnaMurthy Jan 25 '22 at 09:04
but you are doing "country" == "item_lst" which does not make sense. – haneulkim Jan 25 '22 at 09:05
ya you are right, i made a copy paste error fixed it now – Sai Kiran KrishnaMurthy Jan 25 '22 at 09:06
hm... python code = `df.join(gdf, [(df["region"] == gdf["region]), (~array_contains(gdf["item_lst"], df["item"])]` outputs `TypeError Column is not iterable` – haneulkim Jan 25 '22 at 09:13
maybe it should `gdf["Item_lst"]` instead of `gdf["item_lst"]` – Sai Kiran KrishnaMurthy Jan 25 '22 at 09:19
nope, the name of column isn't it. – haneulkim Jan 25 '22 at 09:20
I guess this is your problem : https://stackoverflow.com/questions/36924873/pyspark-column-is-not-iterable array_contains is probably a python function can you rename it (as described in the answer) and try? – Sai Kiran KrishnaMurthy Jan 25 '22 at 09:27
hm.. no that's not it either... hm... – haneulkim Jan 25 '22 at 09:47

score 0 · Answer 2 · answered Jan 25 '22 at 09:03

0

For spark 2.4+, you can use shuffle and array_remove.

new_df = df.join(grp_df, 'region').select('region', 'user', F.expr('shuffle(array_remove(Item_lst, item))[0]').alias('item'))
new_df.show(truncate=False)

answered Jan 25 '22 at 09:03

过过招

3,722
2
4
11

pyspark search dataframe and randomly select value to add to new dataframe

2 Answers2