Pandas distance matrix performance with vector data

Question

Even if I found a few threads dealing with distance matrix efficiency, they all use either an int or float matrix. In my case I have to deal with vectors (orderedDict of frequency), and I only end up with a very slow method that is not viable with a large DataFrame (300,000 x 300,000).

How to make the process more optimized?

I would be very thankful for any help, this problem has been killing me :)

Considering DataFrame df such as:

>>> df
    vectors
id
1   {dict1}
2   {dict2}
3   {dict3}
4   {dict4}

where {dict#}

orderedDict{event1: 1,
            event2: 5,
            event3: 0,
            ...}

A function to return the distance between two vectors:

def vectorDistance(a, b, df_vector):
    # Calculate distance between a & b
    # based on the vector from df_vector.
    return distance

[in]: vectorDistance({dict1}, {dict2})

[out]: distance

A desired Output:

      1     2      3      4 
id
1     0   1<->2  1<->3  1<->4
2   1<->2   0     ...    ...
3   1<->3  ...     0     ...
4   1<->4  ...    ...     0

(where 1<->2 is a float distance between vector 1 & 2)

Method used:

import pandas as pd

matrix = pd.concat([df, df.T], axis=1)

for index in matrix.index:
    for col in matrix.columns:
        matrix.ix[col, index] = vectorDistance(col, index, df)

>>> matrix
          5072142538    5072134420  4716823618   ...
udid            
5072142538  0.00000      0.01501       0.06002   ...
5072134420  0.01501      0.00000       0.09037   ...
4716823618  0.06002      0.09037       0.00000   ...
    ...        ...          ...          ...

EDIT:

Minimal example

Note: The event can differ form one {dict} to another, but it's ok when passed in the function. My issue is more how to pass the right a & b to fill the cell in a fast way.

I am working with cosine distance as it's rather good with vectors such as mine.

from collections import Counter
import pandas as pd 
from math import sqrt 


raw_data = {'counters_': {4716823618: Counter({51811: 1, 51820: 1, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 8, 51853: 5, 51854: 4, 51856: 24, 51903: 11, 51904: 12, 51905: 3, 51906: 19, 51908: 230, 51922: 24, 51927: 19, 51931: 2, 106282: 9, 112830: 1, 119453: 1, 165062: 80, 168904: 3, 180354: 19, 180437: 33, 185824: 117, 186171: 14, 187101: 1, 190827: 7, 201629: 1, 209318: 37}), 5072134420: Counter({51811: 1, 51812: 1, 51820: 1, 51833: 56, 51835: 9, 51843: 49, 51848: 2, 51852: 11, 51853: 4, 51854: 4, 51856: 28, 51885: 1, 51903: 17, 51904: 17, 51905: 9, 51906: 14, 51908: 225, 51927: 29, 51931: 2, 106282: 19, 112830: 2, 168904: 9, 180354: 14, 185824: 219, 186171: 7, 187101: 1, 190827: 6, 201629: 2, 209318: 41}), 5072142538: Counter({51811: 4, 51812: 4, 51820: 4, 51833: 56, 51835: 8, 51843: 48, 51848: 2, 51852: 6, 51853: 3, 51854: 3, 51856: 18, 51885: 1, 51903: 17, 51904: 16, 51905: 3, 51906: 24, 51908: 258, 51927: 20, 51931: 8, 106282: 16, 112830: 2, 168904: 3, 180354: 24, 185824: 180, 186171: 10, 187101: 1, 190827: 7, 201629: 2, 209318: 52})}}


def vectorDistance(index, col):
    a = dict(df[df.index == index]["counters_"].values[0])
    b = dict(df[df.index == col]["counters_"].values[0])
    return abs(np.round(1-(similarity(a,b)),5))

def scalar(collection): 
  total = 0 
  for coin, count in collection.items(): 
    total += count * count 
  return sqrt(total) 

def similarity(A,B): 
  total = 0 
  for kind in A:
    if kind in B: 
      total += A[kind] * B[kind] 
  return float(total) / (scalar(A) * scalar(B))

df = pd.DataFrame(raw_data)
matrix = pd.concat([df, df.T], axis=1)
matrix.drop("counters_",0,inplace=True)
matrix.drop("counters_",1,inplace=True)

for index in matrix.index:
    for col in matrix.columns:
        matrix.ix[col,index] = vectorDistance(col,index)


matrix

Does each dictionary have the same events, or can they differ? You may also need to provide details of your vectorDistance function so that others can replicate results. — Alexander, Dec 06 '15 at 21:26
Hi @Alexander, sorry I pressed enter by accident, I added the details you needed in the question as it was long :) — Benjamin Devienne, Dec 06 '15 at 21:38
Roughly how many unique events are there? Just wondering if it is feasible to calculate distance between each pair and do a lookup. — Alexander, Dec 06 '15 at 21:42
About 1000, but let me do a minimal example file with real data, I'll add the link to the question, it should not be long. — Benjamin Devienne, Dec 06 '15 at 21:45
@Alexander, there you go, I added the minimal example with real values. Looking forward to read you if you can find a solution. — Benjamin Devienne, Dec 06 '15 at 21:58
take a look at scipy `pdist`: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html — maxymoo, Dec 06 '15 at 22:04
also http://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat — maxymoo, Dec 06 '15 at 22:05
Thanks for the links, @maxymoo, unfortunately this is what I was saying in my intro, those approaches work like a charm when you have non complex data such as int or float, but in the case of vectors inside the dataframe, things get pretty messy. I tried those ideas and failed, so maybe I am just missing how to apply them in my situation :( — Benjamin Devienne, Dec 06 '15 at 22:08

score 0 · Answer 1 · edited May 23 '17 at 12:23

0

You don't want to store dicts inside your dataframe. Read in your dataframe using the from_dict method:

df = pd.DataFrame.from_dict(raw_data['counters_'],orient='index')

Then you can apply the numpy/scipy vectorised methods for computing cosine similarity as in What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

edited May 23 '17 at 12:23

Community

1
1

answered Dec 06 '15 at 22:40

maxymoo

35,286
11
92
119

Hello @maxymoo, I see what you're suggesting, it's a good idea, however the thread you point out was unsolved or at least unclear to me, would you mind based on the minimal example add an illustration on how to do it ? I would really appreciate :) – Benjamin Devienne Dec 06 '15 at 23:01

Alexander · Accepted Answer · 2015-12-07T00:43:42.573

This is certainly more efficient and easier to read than using for loops.

df = pd.DataFrame([v for v in raw_data['counters_'].values()], 
                  index=raw_data['counters_'].keys()).T

>>> df.head()
       4716823618  5072134420  5072142538
51811           1           1           4
51812         NaN           1           4
51820           1           1           4
51833          56          56          56
51835           8           9           8

# raw_data no longer needed.  Delete to reduce memory footprint.
del raw_data  

# Create scalars.
scalars = ((df ** 2).sum()) ** .5

>>> scalars
4716823618    289.679133
5072134420    330.548030
5072142538    331.957829
dtype: float64

def v_dist(col_1, col_2):
    return 1 - ((df.iloc[:, col_1] * df.iloc[:, col_2]).sum() / 
                (scalars.iloc[col_1] * scalars.iloc[col_2]))

>>> v_dist(0, 1)
0.09036665882900885

>>> v_dist(0, 2)
0.060016436804916085

>>> v_dist(1, 2)
0.015009898476505357

m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)

>>> m
            4716823618  5072134420  5072142538
4716823618         NaN         NaN         NaN
5072134420         NaN         NaN         NaN
5072142538         NaN         NaN         NaN

for row in range(m.shape[0]):
    for col in range(row, m.shape[1]):  # Note: m.shape[0] equals m.shape[1]
        if row == col:
            # No need to calculate value for diagonal.
            m.iat[row, col] = 0
        else:
            # Do two calculation in one due to symmetry.
            m.iat[row, col] = m.iat[col, row] = v_dist(row, col)

>>> m
            4716823618  5072134420  5072142538
4716823618    0.000000    0.090367    0.060016
5072134420    0.090367    0.000000    0.015010
5072142538    0.060016    0.015010    0.000000

Wrapping all of this into a function:

def calc_matrix(raw_data):
    df = pd.DataFrame([v for v in raw_data['counters_'].values()], 
                      index=raw_data['counters_'].keys()).T
    scalars = ((df ** 2).sum()) ** .5
    m = pd.DataFrame(np.nan * len(df.columns), index=df.columns, columns=df.columns)
    for row in range(m.shape[0]):
        for col in range(row, m.shape[1]):
            if row == col:
                m.iat[row, col] = 0
            else:
                m.iat[row, col] = m.iat[col, row] =  (1 -                    
                    (df.iloc[:, row] * df.iloc[:, col]).sum() / 
                    (scalars.iloc[row] * scalars.iloc[col]))
    return m

Great contribution, thanks @Alexander, it beats my implementation to the pulp, it's roughly 2 times faster :) — Benjamin Devienne, Dec 06 '15 at 23:37
Haha, sorry. :) Well if you have a more optimized version, I'll gladly take it. This one is clearly faster, but still takes a good 7/10~ min to build the real size dataframe. — Benjamin Devienne, Dec 06 '15 at 23:44

Pandas distance matrix performance with vector data

2 Answers2