2

I have a pandas dataframe (say df) of shape (70000 x 10). Head of the data frame shown below:

                          0_x       1_x       2_x  ...       7_x       8_x       9_x
userid                                             ...                              
1000010249674395648  0.000007  0.999936  0.000007  ...  0.000007  0.000007  0.000007
1000282310388932608  0.000060  0.816790  0.000060  ...  0.000060  0.000060  0.000060
1000290654755450880  0.000050  0.000050  0.000050  ...  0.000050  0.191159  0.000050
1000304603840241665  0.993157  0.006766  0.000010  ...  0.000010  0.000010  0.000010
1000600081165438977  0.000064  0.970428  0.000064  ...  0.000064  0.000064  0.000064 

I would like to find the pairwise cosine distances between userid's. For example:

cosine_distance(1000010249674395648, 1000282310388932608) = 0.9758776214797362

I have used the following approaches mentioned here but all throw out of memory error while computing cosine distances because of limited CPU memory:

  1. scikit-learn's cosine_similarity:

    from sklearn.metrics.pairwise import cosine_similarity
    cosine_sim = cosine_similarity(df)
    
  2. A faster vectorized solution found online:

    def get_cosine_sim_df(df):
          topic_vectors = df.values
          norm_topic_vectors = topic_vectors / np.linalg.norm(topic_vectors, axis=-1)[:, np.newaxis]
          cosine_sim = np.dot(norm_topic_vectors, norm_topic_vectors.T)
          cosine_sim_df = pd.DataFrame(data = cosine_sim, index=df.index, columns=df.index)
          return cosine_sim_df
    
    cosine_sim = get_cosine_sim_df(df)
    

System Hardware Overview:

  Model Name: MacBook Pro
  Model Identifier: MacBookPro11,4
  Processor Name: Quad-Core Intel Core i7
  Processor Speed: 2.2 GHz
  Number of Processors: 1
  Total Number of Cores: 4
  L2 Cache (per Core): 256 KB
  L3 Cache: 6 MB
  Hyper-Threading Technology: Enabled
  Memory: 16 GB

I'm looking for an efficient way and quicker way to calculate pairwise cosine distances within CPU memory limit something similar to pyspark dataframes or pandas batch processing techniques rather than processing all the dataframe at once.

Any suggestions/approaches are appreciated.

FYI - I'm using Python 3.7

1 Answers1

0

I am using spark 2.4 and python 3.7

# build spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .master("local") \
                    .appName("cos_sim") \
                    .config("spark.some.config.option", "some-value") \
                    .getOrCreate()

Convert your pandas df to spark df

# Pandas to Spark
df = spark_session.createDataFrame(pand_df)

I generated some random data, instead

import random
import pandas as pd
from pyspark.sql.functions import monotonically_increasing_id 

def generate_random_data(num_usrs = 20, num_cols = 10):
    cols = [str(i)+"_x" for i in range(num_cols)]
    usrsdata = [ [random.random() for i in range(num_cols)] for i in range(num_usrs)]
#     return pd.DataFrame(usrsdata, columns = cols)
    return spark.createDataFrame(data = usrsdata, schema = cols)

df = generate_random_data()
df = df.withColumn("uid", monotonically_increasing_id())
df.limit(5).toPandas()   # just for nice display of df (df not actually changed)

spark_df_user_data

Convert columns of df to a features vector

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembled = assembler.transform(df).select(['uid', 'features'])
assembled.limit(2).toPandas()

uid_features_df

Normalize

from pyspark.ml.feature import Normalizer
normalizer = Normalizer(inputCol="features", outputCol="norm")
data = normalizer.transform(assembled)
data.limit(2).toPandas()

normalized_features

Calculate pairwise cosine similarities

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(data.select("uid", "norm").rdd\
        .map(lambda row: IndexedRow(row.uid, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()[:2]  # displaying first 2 users only

enter image description here

References: Calculating the cosine similarity between all the rows of a dataframe in pyspark

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Samer Ayoub
  • 981
  • 9
  • 10