Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

Question

What's the best way to do an operation on a dataframe that, for every row, I need to do a selection on another dataframe?

For example:

My first dataframe has the similarity between every to pairs of items. For starters, I'll assume every similarity as zero and calculate the correct similarity later.

import pandas as pd
import numpy as np
import scipy as sp
from scipy.spatial import distance

 

items = [1,2,3,4]
item_item_idx = pd.MultiIndex.from_product([items, items], names = ['from_item', 'to_item'])
item_item_df = pd.DataFrame({'similarity': np.zeros(len(item_item_idx))}, 
                            index = item_item_idx
                            )

My next dataframe has the rating every user gave for every item. For sake of simplification, let's assume every user rated every item and generate random ratings between 1 and 5.

users = [1,2,3,4,5]

ratings_idx = pd.MultiIndex.from_product([items, users], names = ['item', 'user'])
rating_df = pd.DataFrame(
    {'rating': np.random.randint(low = 1, high =  6, size = len(users)*len(items))},
    columns = ['rating'],
    index = ratings_idx
    )

Now that I have the ratings, I want to update the cosine similarity between the items. What I need to do is, for every row in item_item_df, select to from rating_df the vector of ratings for each item, and calculate the cosine distance between those two.

I want to know the least dumb way to do this. Here's what I tried so far:

==== FIRST TRY - Iterating over rows

def similarity(ii, iu):
    
    for index, row in ii.iterrows():
        
        v = iu.loc[index[0]]
        u = iu.loc[index[1]]
        
        row['similarity'] = distance.cosine(v, u)
    
    return(ii)


import time

start_time = time.time()
item_item_df = similarity(item_item_df, rating_df)
print('Time: {:f}s'.format(time.time() - start_time))

Took me 0.01002s to run this. In problem with 10k items, I estimate it would take in th ballpark of 20 hours to run. Not good.

The thing is, I'm iterating over rows, my hope is that I can vectorize this to make it faster. I played around with df.apply() and df.map(). This is the best I did so far:

==== SECOND TRY - index.map()

def similarity_map(idx):
    
    v = rating_df.loc[idx[0]]
    u = rating_df.loc[idx[1]]
    
    return distance.cosine(v, u)

start_time = time.time()
item_item_df['similarity'] = item_item_df.index.map(similarity_map)
print('Time: {:f}s'.format(time.time() - start_time))

Took me 0.034961s to execute. Slower than just iterating over rows.

So this was a naive attempt to vectorize. Is it even possible to do? What other options I have to improve the runtime?

Thanks for the attention.

The **second** answer to this question might be helpful: [How to iterate over rows in a DataFrame in Pandas](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas). — Timus, Nov 03 '20 at 19:55

score 0 · Answer 1 · answered Nov 03 '20 at 20:00

For your given example I'd just pivot it into an array and move on with my life.

from sklearn.metrics.pairwise import cosine_similarity

rating_df = rating_df.reset_index().pivot(index='item', columns='user')
cs_df = pd.DataFrame(cosine_similarity(rating_df),
                     index=rating_df.index, columns=rating_df.index)

>>> cs_df
item         1         2         3         4
item                                        
1     1.000000  0.877346  0.660529  0.837611
2     0.877346  1.000000  0.608781  0.852029
3     0.660529  0.608781  1.000000  0.758098
4     0.837611  0.852029  0.758098  1.000000

This would be more difficult with a giant, highly-sparse array. Sklearn cosine_similarity takes sparse arrays though so as long as your number of items is reasonable (since the output matrix will be dense) this should be solvable.

Yep. This works. Fast. My specific problem is pretty much solved. Thank you. — André Luiz Tiago Soares, Nov 03 '20 at 20:33
Just to reiterate, I ran this in a sample of the real-world problem I'm working on, with 4k users and 14k items. I was afraid I'd take 20 hours to do this. Your suggestion did it in 50 seconds. — André Luiz Tiago Soares, Nov 03 '20 at 20:51

wwii · Answer 2 · 2020-11-04T16:34:02.157

Same thing but different. Work with numpy arrays. Fine for small arrays but with 10k rows you'll have some large arrays.

import numpy as np
data = rating_df.unstack().values     # shape (4,5)
udotv = np.dot(data,data.T)           # shape (4,4)
mag_data = np.linalg.norm(data,axis=1)
mag = mag_data * mag_data[:,None]
cos_sim = 1 - (udotv / mag)
df['sim2'] = cos_sim.flatten()

4k users and 14k items pretty much blows up my poor computer. I'm going to have to look how sklearn.metrics.pairwise.cosine_similarity handles that large data.

Vectorize operation on dataframe where I need to subset another dataframe (pearson correlation)

2 Answers2