I have 3 DataFrames in Pandas:
UserItem is a DataFrame of users and items that the users chose, with 2 columns, User and Item.
UserTag is a DataFrame of users and tags, with 2 columns, User and Tag.
ItemTag is a DataFrame of items and tags, with 2 columns, Item and Tag.
UserItem_df = pd.DataFrame({'user': ['A', 'B', 'B'] , 'item': ['i', 'j', 'k']})
UserTag_df = pd.DataFrame({'user': ['A', 'B'] , 'tag' : ['T', 'R']})
ItemTag_df = pd.DataFrame({'item': ['i', 'j', 'k', 'k'] , 'tag' : ['T', 'S', 'T', 'R']})
I want to compute, for each (user, item) pair in UserItem, the size of the intersection (and union as well!) of the tags of that user with the tags of that item.
Answer_df = pd.DataFrame({'user': ['A', 'B', 'B'] , 'item': ['i', 'j', 'k'], 'intersection': [1, 0, 1], 'union' : [1, 2, 2]})
What's the most efficient way to do this? These are DataFrames with 30M rows (UserItem_df
), and about 500k rows for the other two. The product set of all possible (user, item) pairs is about 30 billion - I don't need the intersection and unions for all possible pairs, just the ones in the UserItem dataframe.