I currently have a dataframe dta
of over 100,000 rows and over 100 columns, where dta[i, j]
is a list of element. My goal is to compute a symmetric table a
where a[i,j] = mean([len(intersect(dta[k, i],dta[k, j]))])
, i.e., for each two columns, compute the pairwise intersection number and then take the mean of all rows.
A simple code for creating example would be
dta = pd.DataFrame(
{
"a" : ["a", "b", "a", "a", "b", "a"],
"b" : ["a", "b", "a","a", "b", "a"],
"c" : ["a", "ee", "c","a", "b", "a"],
"d" : ["aaa b", "bbb a", "ccc c","a", "b", "a"]
}
)
dta = dta.applymap(lambda x : x.split() )
table = pd.DataFrame(np.zeros((4,4)))
for i in range(4) :
for j in range(i, 4) :
table.iloc[i,j] = dta.apply(
lambda x : len(set(x[i]).intersection(set(x[j]))), axis=1
).mean()
table
The example input is
a b c d
0 [a] [a] [a] [aaa, b]
1 [b] [b] [ee][bbb, a]
2 [a] [a] [c] [ccc, c]
3 [a] [a] [a] [a]
4 [b] [b] [b] [b]
5 [a] [a] [a] [a]
and the output is
0 1 2 3
0 1.0 1.0 0.666667 0.500000
1 0.0 1.0 0.666667 0.500000
2 0.0 0.0 1.000000 0.666667
3 0.0 0.0 0.000000 1.500000
My current method is as follows :
def func(row, col1, col2) -> float :
list1, list2 = row[col1], row[col2]
return len(set(list1).intersection(list2))
for col_id, col in enumerate(colnames) :
for tgt_col_id in range(col_id, col_num) :
a.loc[col_id, tgt_col_id] = (
dta.apply(func, args=(col, colnames[tgt_col_id]), axis=1
).mean()
My idea is that I could probably speed up with multiprocessing in column looping, since each pair operation does not coincide. But is there any numpy / pandas
way to speed up the operation between two columns ?
And idea to speed up the processing would be helpful !