I have been struggling with this for a few days now. I read a lot online, found some similar questions such as: Pandas counting occurrence of list contained in column of lists or pandas: count string criteria across down rows but neither fully work in this case.
I have two dataframes: df1 consists of a column of strings. df2 consists of a column of lists (the lists are a combination of the strings from df1, each element within one list is unique).
I would like to know in how many lists of df2 occur each combination of strings. So, how many lists have "a" and "b" as elements? How many lists have "a" and "c" as elements and so forth.
This is how df1 looks like (simplified):
df1 = pd.DataFrame({"subject": ["a", "b", "c"]})
df1
subject
0 a
1 b
3 c
This is how df2 looks like (simplified).
df2 = pd.DataFrame({"subject_list": [["a", "b" ,"c"], ["b", "c"], ["a", "b"], ["b", "c"], ["c"]]})
df2
subject_list
0 ["a", "b" ,"c"]
1 ["a", "b"]
2 ["b", "c"]
3 ["c"]
4 ["b", "c"]
I have two codes which both work but aren't quite right:
This code looks for the combination of two rows in df1 (as wanted). However, df1 includes more rows than df2 so it stops with the last row of df2. But there are still some "string-combinations" to test.
df1["combination_0"] = df2["subject_list"].apply(lambda x: x.count(x and df.subject[0]))
This code counts the occurrence of one "list". However, I can't figure out how to change it so that it does it for each value combination.
df1["list_a_b"] = df2["subject_list"].apply(lambda x: x.count(df1.subject[0] and df1.subject[1]))
df1.list_a_b.sum()