Calculating cosine distances for a pandas dataframe

Question

I have a pandas dataframe (say df) of shape (70000 x 10). Head of the data frame shown below:

                          0_x       1_x       2_x  ...       7_x       8_x       9_x
userid                                             ...                              
1000010249674395648  0.000007  0.999936  0.000007  ...  0.000007  0.000007  0.000007
1000282310388932608  0.000060  0.816790  0.000060  ...  0.000060  0.000060  0.000060
1000290654755450880  0.000050  0.000050  0.000050  ...  0.000050  0.191159  0.000050
1000304603840241665  0.993157  0.006766  0.000010  ...  0.000010  0.000010  0.000010
1000600081165438977  0.000064  0.970428  0.000064  ...  0.000064  0.000064  0.000064

I would like to find the pairwise cosine distances between userid's. For example:

cosine_distance(1000010249674395648, 1000282310388932608) = 0.9758776214797362

I have used the following approaches mentioned here but all throw out of memory error while computing cosine distances because of limited CPU memory:

scikit-learn's cosine_similarity:

from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(df)

A faster vectorized solution found online:

def get_cosine_sim_df(df):
      topic_vectors = df.values
      norm_topic_vectors = topic_vectors / np.linalg.norm(topic_vectors, axis=-1)[:, np.newaxis]
      cosine_sim = np.dot(norm_topic_vectors, norm_topic_vectors.T)
      cosine_sim_df = pd.DataFrame(data = cosine_sim, index=df.index, columns=df.index)
      return cosine_sim_df

cosine_sim = get_cosine_sim_df(df)

System Hardware Overview:

  Model Name: MacBook Pro
  Model Identifier: MacBookPro11,4
  Processor Name: Quad-Core Intel Core i7
  Processor Speed: 2.2 GHz
  Number of Processors: 1
  Total Number of Cores: 4
  L2 Cache (per Core): 256 KB
  L3 Cache: 6 MB
  Hyper-Threading Technology: Enabled
  Memory: 16 GB

I'm looking for an efficient way and quicker way to calculate pairwise cosine distances within CPU memory limit something similar to pyspark dataframes or pandas batch processing techniques rather than processing all the dataframe at once.

Any suggestions/approaches are appreciated.

FYI - I'm using Python 3.7

score 0 · Answer 1 · edited Feb 23 '20 at 10:12

I am using spark 2.4 and python 3.7

# build spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .master("local") \
                    .appName("cos_sim") \
                    .config("spark.some.config.option", "some-value") \
                    .getOrCreate()

Convert your pandas df to spark df

# Pandas to Spark
df = spark_session.createDataFrame(pand_df)

I generated some random data, instead

import random
import pandas as pd
from pyspark.sql.functions import monotonically_increasing_id 

def generate_random_data(num_usrs = 20, num_cols = 10):
    cols = [str(i)+"_x" for i in range(num_cols)]
    usrsdata = [ [random.random() for i in range(num_cols)] for i in range(num_usrs)]
#     return pd.DataFrame(usrsdata, columns = cols)
    return spark.createDataFrame(data = usrsdata, schema = cols)

df = generate_random_data()
df = df.withColumn("uid", monotonically_increasing_id())
df.limit(5).toPandas()   # just for nice display of df (df not actually changed)

Convert columns of df to a features vector

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembled = assembler.transform(df).select(['uid', 'features'])
assembled.limit(2).toPandas()

Normalize

from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="features", outputCol="norm")
data = normalizer.transform(assembled)
data.limit(2).toPandas()

Calculate pairwise cosine similarities

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(data.select("uid", "norm").rdd\
        .map(lambda row: IndexedRow(row.uid, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()[:2]  # displaying first 2 users only

References: Calculating the cosine similarity between all the rows of a dataframe in pyspark

Calculating cosine distances for a pandas dataframe

1 Answers1

Convert columns of df to a features vector

Normalize

Calculate pairwise cosine similarities