What's the most efficient way to find intersections from secondary tables based on a pair of columns in a pandas DataFrame?

Question

I have 3 DataFrames in Pandas:

UserItem is a DataFrame of users and items that the users chose, with 2 columns, User and Item.

UserTag is a DataFrame of users and tags, with 2 columns, User and Tag.

ItemTag is a DataFrame of items and tags, with 2 columns, Item and Tag.

UserItem_df = pd.DataFrame({'user': ['A', 'B', 'B']      ,  'item': ['i', 'j', 'k']})
UserTag_df  = pd.DataFrame({'user': ['A', 'B']           ,  'tag' : ['T', 'R']})
ItemTag_df  = pd.DataFrame({'item': ['i', 'j', 'k', 'k'] ,  'tag' : ['T', 'S', 'T', 'R']})

I want to compute, for each (user, item) pair in UserItem, the size of the intersection (and union as well!) of the tags of that user with the tags of that item.

Answer_df = pd.DataFrame({'user': ['A', 'B', 'B']  , 'item': ['i', 'j', 'k'], 'intersection':  [1, 0, 1], 'union' : [1, 2, 2]})

What's the most efficient way to do this? These are DataFrames with 30M rows (UserItem_df), and about 500k rows for the other two. The product set of all possible (user, item) pairs is about 30 billion - I don't need the intersection and unions for all possible pairs, just the ones in the UserItem dataframe.

Please include a small sample of your dataframes. Take a look at [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Shubham Sharma, Jun 16 '20 at 13:49
Also consider adding an expected output for the given samples. — Shubham Sharma, Jun 16 '20 at 14:05
I have added examples with a sample output dataframe as well. Thanks! — Roko Mijic, Jun 16 '20 at 14:16

Shubham Sharma · Accepted Answer · 2020-06-16T16:24:35.420

1

Use :

# step 1:
df1 = pd.merge(UserItem_df, UserTag_df, on='user')

# step 2:
df2 = pd.merge(UserItem_df, ItemTag_df, on='item')

# step 3
df3 = pd.concat([df1, df2], ignore_index=True)

# step 4
df3 = (
    df3.groupby(['user', 'item'])['tag']
    .agg(intersection='count', union='nunique')
    .reset_index()
)
df3['intersection'] -= df3['union']

Steps:

# step 1: df1
  user item tag
0    A    i   T
1    B    j   R
2    B    k   R

# step 2: df2
  user item tag
0    A    i   T
1    B    j   S
2    B    k   T
3    B    k   R

# step 3: df3
  user item tag
0    A    i   T
1    B    j   R
2    B    k   R
3    A    i   T
4    B    j   S
5    B    k   T
6    B    k   R

# step 4: df3
  user item  intersection  union
0    A    i             1      1
1    B    j             0      2
2    B    k             1      2

edited Jun 16 '20 at 16:24

answered Jun 16 '20 at 14:36

Shubham Sharma

68,127
6
24
53

Thanks, this seems to work for intersection, but for union the time scaling seems to be bad. – Roko Mijic Jun 16 '20 at 14:58
What do you mean by `time scaling`? Can you explain more? – Shubham Sharma Jun 16 '20 at 14:59
using just .agg( intersection='sum' ).reset_index() , the query is almost instant, but using union=lambda s: s.eq(0).sum() * 2 it is still running... – Roko Mijic Jun 16 '20 at 15:03
ok, union=lambda s: s.eq(0).sum() * 2 finished, but took a few minutes to run on about 1% of my data. There is probably a better way to get the union, e.g. subtract the intersection from the sum – Roko Mijic Jun 16 '20 at 15:06
I think a groupby and count on your df1 and df2 will give the sum of tags, then subtracting the intersection will give the union. – Roko Mijic Jun 16 '20 at 15:14
1

@RokoMijic Check out the edited answer, i've made few changes to make the answer more concise. – Shubham Sharma Jun 16 '20 at 16:26

What's the most efficient way to find intersections from secondary tables based on a pair of columns in a pandas DataFrame?

1 Answers1