What's the best way to do an operation on a dataframe that, for every row, I need to do a selection on another dataframe?
For example:
My first dataframe has the similarity between every to pairs of items. For starters, I'll assume every similarity as zero and calculate the correct similarity later.
import pandas as pd
import numpy as np
import scipy as sp
from scipy.spatial import distance
items = [1,2,3,4]
item_item_idx = pd.MultiIndex.from_product([items, items], names = ['from_item', 'to_item'])
item_item_df = pd.DataFrame({'similarity': np.zeros(len(item_item_idx))},
index = item_item_idx
)
My next dataframe has the rating every user gave for every item. For sake of simplification, let's assume every user rated every item and generate random ratings between 1 and 5.
users = [1,2,3,4,5]
ratings_idx = pd.MultiIndex.from_product([items, users], names = ['item', 'user'])
rating_df = pd.DataFrame(
{'rating': np.random.randint(low = 1, high = 6, size = len(users)*len(items))},
columns = ['rating'],
index = ratings_idx
)
Now that I have the ratings, I want to update the cosine similarity between the items. What I need to do is, for every row in item_item_df
, select to from rating_df
the vector of ratings for each item, and calculate the cosine distance between those two.
I want to know the least dumb way to do this. Here's what I tried so far:
==== FIRST TRY - Iterating over rows
def similarity(ii, iu):
for index, row in ii.iterrows():
v = iu.loc[index[0]]
u = iu.loc[index[1]]
row['similarity'] = distance.cosine(v, u)
return(ii)
import time
start_time = time.time()
item_item_df = similarity(item_item_df, rating_df)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.01002s to run this. In problem with 10k items, I estimate it would take in th ballpark of 20 hours to run. Not good.
The thing is, I'm iterating over rows, my hope is that I can vectorize this to make it faster. I played around with df.apply() and df.map(). This is the best I did so far:
==== SECOND TRY - index.map()
def similarity_map(idx):
v = rating_df.loc[idx[0]]
u = rating_df.loc[idx[1]]
return distance.cosine(v, u)
start_time = time.time()
item_item_df['similarity'] = item_item_df.index.map(similarity_map)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.034961s to execute. Slower than just iterating over rows.
So this was a naive attempt to vectorize. Is it even possible to do? What other options I have to improve the runtime?
Thanks for the attention.