I have a pandas dataframe (say df) of shape (70000 x 10). Head of the data frame shown below:
0_x 1_x 2_x ... 7_x 8_x 9_x
userid ...
1000010249674395648 0.000007 0.999936 0.000007 ... 0.000007 0.000007 0.000007
1000282310388932608 0.000060 0.816790 0.000060 ... 0.000060 0.000060 0.000060
1000290654755450880 0.000050 0.000050 0.000050 ... 0.000050 0.191159 0.000050
1000304603840241665 0.993157 0.006766 0.000010 ... 0.000010 0.000010 0.000010
1000600081165438977 0.000064 0.970428 0.000064 ... 0.000064 0.000064 0.000064
I would like to find the pairwise cosine distances between userid's. For example:
cosine_distance(1000010249674395648, 1000282310388932608) = 0.9758776214797362
I have used the following approaches mentioned here but all throw out of memory error while computing cosine distances because of limited CPU memory:
scikit-learn's cosine_similarity:
from sklearn.metrics.pairwise import cosine_similarity cosine_sim = cosine_similarity(df)
A faster vectorized solution found online:
def get_cosine_sim_df(df): topic_vectors = df.values norm_topic_vectors = topic_vectors / np.linalg.norm(topic_vectors, axis=-1)[:, np.newaxis] cosine_sim = np.dot(norm_topic_vectors, norm_topic_vectors.T) cosine_sim_df = pd.DataFrame(data = cosine_sim, index=df.index, columns=df.index) return cosine_sim_df cosine_sim = get_cosine_sim_df(df)
System Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,4
Processor Name: Quad-Core Intel Core i7
Processor Speed: 2.2 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
I'm looking for an efficient way and quicker way to calculate pairwise cosine distances within CPU memory limit something similar to pyspark dataframes or pandas batch processing techniques rather than processing all the dataframe at once.
Any suggestions/approaches are appreciated.
FYI - I'm using Python 3.7