I have the following dataframe
import pandas as pd
df = pd.DataFrame({'TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
'Snack' : [1, 0, 1, 1, 0, 0],
'Trans' : [1, 1, 1, 0, 0, 1],
'Dop' : [1, 0, 1, 0, 1, 1]}).set_index('TFD')
df
Snack Trans Dop
TFD
AA 1 1 1
SL 0 1 0
BB 1 1 1
D0 1 0 0
Dk 0 0 1
FF 0 1 1
By using this I can calculate the following co-occurrence matrix:
df_asint = df.astype(int)
coocc = df_asint.T.dot(df_asint)
coocc
Snack Trans Dop
Snack 3 2 2
Trans 2 4 3
Dop 2 3 4
Though, I want the occurrences to not overlap.
What I mean is this:
in the original
df
there is only 1TFD
that has onlySnack
, so the[Snack, Snack]
value at thecooc
table should be1
.Moreover the
[Dop, Trans]
should be equal to1
and not equal to3
(the above calculation gives as output3
because it takes into account the[Dop, Snack, Trans]
combination, which is what I want to avoid)Moreover the order shouldnt matter ->
[Dop, Trans]
is the same as[Trans, Dop]
Having an
['all', 'all'] [row, column]
which would indicate how many times an occurrence contains all elements
My solution contains the following steps:
First, for each row of the df
get the list of columns for which the column has value equal to 1
:
llist = []
for k,v in df.iterrows():
llist.append((list(v[v==1].index)))
llist
[['Snack', 'Trans', 'Dop'],
['Trans'],
['Snack', 'Trans', 'Dop'],
['Snack'],
['Dop'],
['Trans', 'Dop']]
Then I duplicate the lists (inside the list) which have only 1 element:
llist2 = llist.copy()
for i,l in enumerate(llist2):
if len(l) == 1:
llist2[i] = l + l
if len(l) == 3:
llist2[i] = ['all', 'all'] # this is to see how many triple elements I have in the list
llist2.append(['Dop', 'Trans']) # This is to test that the order of the elements of the sublists doesnt matter
llist2
[['all', 'all'],
['Trans', 'Trans'],
['all', 'all'],
['Snack', 'Snack'],
['Dop', 'Dop'],
['Trans', 'Dop'],
['Dop', 'Trans']]
Later I create an empty dataframe with the indexes and columns of interest:
elements = ['Trans', 'Dop', 'Snack', 'all']
foo = pd.DataFrame(columns=elements, index=elements)
foo.fillna(0,inplace=True)
foo
Trans Dop Snack all
Trans 0 0 0 0
Dop 0 0 0 0
Snack 0 0 0 0
all 0 0 0 0
Then I check and count, which combination is included in the original llist2
from itertools import combinations_with_replacement
import collections
comb = combinations_with_replacement(elements, 2)
for l in comb:
val = foo.loc[l[0],l[1]]
foo.loc[l[0],l[1]] = val + llist2.count(list(l))
if (set(l).__len__() != 1) and (list(reversed(list(l))) in llist2): # check if the reversed element exists as well, but do not double count the diagonal elements
val = foo.loc[l[0],l[1]]
foo.loc[l[0],l[1]] = val + llist2.count(list(reversed(list(l))))
foo
Trans Dop Snack all
Trans 1 2 0 0
Dop 0 1 0 0
Snack 0 0 1 0
all 0 0 0 2
Last step would be to make foo
symmetrical:
import numpy as np
foo = np.maximum( foo, foo.transpose() )
foo
Trans Dop Snack all
Trans 1 2 0 0
Dop 2 1 0 0
Snack 0 0 1 0
all 0 0 0 2
Looking for a more efficient/faster (avoiding all these loops) solution