1

I currently have a dataframe dta of over 100,000 rows and over 100 columns, where dta[i, j] is a list of element. My goal is to compute a symmetric table a where a[i,j] = mean([len(intersect(dta[k, i],dta[k, j]))]), i.e., for each two columns, compute the pairwise intersection number and then take the mean of all rows.

A simple code for creating example would be

dta = pd.DataFrame(
    {
        "a" : ["a", "b", "a", "a", "b", "a"],
        "b" : ["a", "b", "a","a", "b", "a"],
        "c" : ["a", "ee", "c","a", "b", "a"],
        "d" : ["aaa b", "bbb a", "ccc c","a", "b", "a"]
    }
)
dta = dta.applymap(lambda x : x.split() )
table = pd.DataFrame(np.zeros((4,4)))
for i in range(4) : 
    for j in range(i, 4) : 
        table.iloc[i,j] = dta.apply(
            lambda x : len(set(x[i]).intersection(set(x[j]))), axis=1
        ).mean()
table 

The example input is

     a   b   c   d
0   [a] [a] [a] [aaa, b]
1   [b] [b] [ee][bbb, a]
2   [a] [a] [c] [ccc, c]
3   [a] [a] [a] [a]
4   [b] [b] [b] [b]
5   [a] [a] [a] [a]

and the output is

    0    1      2           3
0   1.0 1.0 0.666667    0.500000
1   0.0 1.0 0.666667    0.500000
2   0.0 0.0 1.000000    0.666667
3   0.0 0.0 0.000000    1.500000

My current method is as follows :

def func(row, col1, col2) -> float : 
    list1, list2 = row[col1], row[col2]
    return len(set(list1).intersection(list2))

for col_id, col in enumerate(colnames) : 
    for tgt_col_id in range(col_id, col_num) : 
        a.loc[col_id, tgt_col_id] = (
            dta.apply(func, args=(col, colnames[tgt_col_id]), axis=1
        ).mean() 

My idea is that I could probably speed up with multiprocessing in column looping, since each pair operation does not coincide. But is there any numpy / pandas way to speed up the operation between two columns ?

And idea to speed up the processing would be helpful !

AlexK
  • 2,855
  • 9
  • 16
  • 27
exteral
  • 991
  • 2
  • 12
  • 33

1 Answers1

0

Not sure how much of a speed boost this is going to provide with your full data, but this code eliminates loops.

from itertools import combinations_with_replacement
import pandas as pd
import numpy as np

dta = pd.DataFrame(
    {
        "a" : ["a", "b", "a", "a", "b", "a"],
        "b" : ["a", "b", "a","a", "b", "a"],
        "c" : ["a", "ee", "c","a", "b", "a"],
        "d" : ["aaa b", "bbb a", "ccc c","a", "b", "a"]
    }
)
dta = dta.applymap(lambda x : x.split() )

table = np.zeros((4,4))
iter_arr = (
    np.array(list(combinations_with_replacement(range(table.shape[1]), 2))
)
# code above creates array of column combinations:
# array([[0, 0],
#        [0, 1],
#        [0, 2],
#        [0, 3],
#        [1, 1],
#        [1, 2],
#        [1, 3],
#        [2, 2],
#        [2, 3],
#        [3, 3]])

def mean_of_set_intersects(x, y):
    f = lambda a: set(a)
    vset = np.vectorize(f)
    x_set = vset(x)
    y_set = vset(y)

    f2 = lambda x, y: len(x.intersection(y))
    vlenint = np.vectorize(f2)

    return np.mean(vlenint(x_set,y_set), axis=1)

table[iter_arr[:,0],iter_arr[:,1]] = (
    mean_of_set_intersects(
        dta.values.transpose()[iter_arr][:,0,:],
        dta.values.transpose()[iter_arr][:,1,:]
    )
)
pd.DataFrame(table)
#      0      1        2           3
# 0  1.0    1.0 0.666667    0.500000
# 1  0.0    1.0 0.666667    0.500000
# 2  0.0    0.0 1.000000    0.666667
# 3  0.0    0.0 0.000000    1.500000

This answer was used as a reference: https://stackoverflow.com/a/49821744/9987623

AlexK
  • 2,855
  • 9
  • 16
  • 27