0

I have the following dataframe

import pandas as pd
df = pd.DataFrame({'TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
                    'Snack' : [1, 0, 1, 1, 0, 0],
                    'Trans' : [1, 1, 1, 0, 0, 1],
                    'Dop' : [1, 0, 1, 0, 1, 1]}).set_index('TFD')
df

    Snack   Trans   Dop
TFD         
AA  1   1   1
SL  0   1   0
BB  1   1   1
D0  1   0   0
Dk  0   0   1
FF  0   1   1

By using this I can calculate the following co-occurrence matrix:

df_asint = df.astype(int)
coocc = df_asint.T.dot(df_asint)
coocc

    Snack   Trans   Dop
Snack   3   2   2
Trans   2   4   3
Dop     2   3   4

Though, I want the occurrences to not overlap.

What I mean is this:

  • in the original df there is only 1 TFD that has only Snack, so the [Snack, Snack] value at the cooc table should be 1.

  • Moreover the [Dop, Trans] should be equal to 1 and not equal to 3(the above calculation gives as output 3 because it takes into account the [Dop, Snack, Trans] combination, which is what I want to avoid)

  • Moreover the order shouldnt matter -> [Dop, Trans] is the same as [Trans, Dop]

  • Having an ['all', 'all'] [row, column] which would indicate how many times an occurrence contains all elements

My solution contains the following steps:

First, for each row of the df get the list of columns for which the column has value equal to 1:

llist = []
for k,v in df.iterrows():
    llist.append((list(v[v==1].index)))
llist

[['Snack', 'Trans', 'Dop'],
 ['Trans'],
 ['Snack', 'Trans', 'Dop'],
 ['Snack'],
 ['Dop'],
 ['Trans', 'Dop']]

Then I duplicate the lists (inside the list) which have only 1 element:

llist2 = llist.copy()
for i,l in enumerate(llist2):
    if len(l) == 1:
        llist2[i] = l + l
    if len(l) == 3:
        llist2[i] = ['all', 'all'] # this is to see how many triple elements I have in the list
llist2.append(['Dop', 'Trans']) # This is to test that the order of the elements of the sublists doesnt matter
llist2

[['all', 'all'],
 ['Trans', 'Trans'],
 ['all', 'all'],
 ['Snack', 'Snack'],
 ['Dop', 'Dop'],
 ['Trans', 'Dop'],
 ['Dop', 'Trans']]

Later I create an empty dataframe with the indexes and columns of interest:

elements = ['Trans', 'Dop', 'Snack', 'all']
foo = pd.DataFrame(columns=elements, index=elements)
foo.fillna(0,inplace=True)
foo

Trans   Dop Snack   all
Trans   0   0   0   0
Dop     0   0   0   0
Snack   0   0   0   0
all     0   0   0   0

Then I check and count, which combination is included in the original llist2

from itertools import combinations_with_replacement
import collections

comb = combinations_with_replacement(elements, 2)
for l in comb:
    val = foo.loc[l[0],l[1]]
    foo.loc[l[0],l[1]] = val + llist2.count(list(l))
    if (set(l).__len__() != 1) and (list(reversed(list(l))) in llist2): # check if the reversed element exists as well, but do not double count the diagonal elements
        val = foo.loc[l[0],l[1]]
        foo.loc[l[0],l[1]] = val + llist2.count(list(reversed(list(l))))
foo

Trans   Dop Snack   all
Trans   1   2   0   0
Dop     0   1   0   0
Snack   0   0   1   0
all     0   0   0   2

Last step would be to make foo symmetrical:

import numpy as np

foo = np.maximum( foo, foo.transpose() )
foo

Trans   Dop Snack   all
Trans   1   2   0   0
Dop     2   1   0   0
Snack   0   0   1   0
all     0   0   0   2

Looking for a more efficient/faster (avoiding all these loops) solution

quant
  • 4,062
  • 5
  • 29
  • 70
  • 1
    you say "the [Dop, Trans] should be equal to 1 and not equal to 3" But in your solution it is 2? – Atanas Atanasov Dec 01 '22 at 13:51
  • also, do I understand correctly that as long as Snack = 1. We can ignore what is in rows "Trans" and "Dop" if we want to calculate [Trans, Dop]? – Atanas Atanasov Dec 01 '22 at 14:02
  • @AtanasAtanasov [Dop,Trans] is equal to 2 in the end, because I manually I add a test line: `llist2.append(['Dop', 'Trans']) # This is to test that the order of the elements of the sublists doesnt matter` in order to prove that the sequence shouldnt matter – quant Dec 02 '22 at 10:45
  • @AtanasAtanasov I do not understand your second question – quant Dec 02 '22 at 10:46

1 Answers1

1

Managed to shrink it to one "for" loop. I am using "any" and "all" in combination with "mask".

import pandas as pd
import itertools


df = pd.DataFrame({'TFD': ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'],
                   'Snack': [1, 0, 1, 1, 0, 0],
                   'Trans': [1, 1, 1, 0, 0, 1],
                   'Dop':   [1, 0, 1, 0, 1, 1]}).set_index('TFD')

df["all"] = 0  # adding artifical columns so the results contains "all"
list_of_columns = list(df.columns)
my_result_list = []  # empty list where we put the results
comb = itertools.combinations_with_replacement(list_of_columns, 2)
for item in comb:
    temp_list = list_of_columns[:]  # temp_list holds columns of interest
    if item[0] == item[1]:
        temp_list.remove(item[0])
        my_col_list = [item[0]]  # my_col_list holds which occurance we count
    else:
        temp_list.remove(item[0])
        temp_list.remove(item[1])
        my_col_list = [item[0], item[1]]

    mask = df.loc[:, temp_list].any(axis=1)  # creating mask so we know which rows to look at
    distance = df.loc[~mask, my_col_list].all(axis=1).sum()  # calculating ocurrance
    my_result_list.append([item[0], item[1], distance])  # occurance info recorded in the list
    my_result_list.append([item[1], item[0], distance])  # occurance put in reverse so we get square form in the end

result = pd.DataFrame(my_result_list).drop_duplicates().pivot(index=1, columns=0, values=2)  # construc DataFrame in squareform
list_of_columns.remove("all")
result.loc["all", "all"] = df.loc[:, list_of_columns].all(axis=1).sum()  # fill in all/all occurances
print(result)
Atanas Atanasov
  • 359
  • 1
  • 10