3

I'm trying to get the count (where all row values are 1) of each possible combination between the eight columns of a dataframe. Basically I need to understand how many times different overlaps exist.

I've tried to use itertools.product to get all the combinations, but it doesn't seem to work.

import pandas as pd
import numpy as np
import itertools

df = pd.read_excel('filename.xlsx')

df.head(15)

    a   b   c   d   e   f   g   h
0   1   0   0   0   0   1   0   0
1   1   0   0   0   0   0   0   0
2   1   0   1   1   1   1   1   1
3   1   0   1   1   0   1   1   1
4   1   0   0   0   0   0   0   0
5   0   1   0   0   1   1   1   1
6   1   1   0   0   1   1   1   1
7   1   1   1   1   1   1   1   1
8   1   1   0   0   1   1   0   0
9   1   1   1   0   1   0   1   0
10  1   1   1   0   1   1   0   0
11  1   0   0   0   0   1   0   0
12  1   1   1   1   1   1   1   1
13  1   1   1   1   1   1   1   1
14  0   1   1   1   1   1   1   0


print(list(itertools.product(new_df.columns)))

The expected output would be a dataframe with the count (n) of rows for each of valid combinations (where values in the row are all 1).

For example:

    a   b
0   1   0   
1   1   0   
2   1   0   
3   1   0   
4   1   0   
5   0   1   
6   1   1   
7   1   1   
8   1   1   
9   1   1   
10  1   1   
11  1   0   
12  1   1   
13  1   1   
14  0   1

Would give

combination   count

a              12
a_b             7
b               9

Note that the output would need to contain all the combinations possible between a and h, not just pairwise

Dan
  • 431
  • 6
  • 20

4 Answers4

6

Powerset Combinations

Use the powerset recipe with,

s = pd.Series({
    '_'.join(c): df[c].min(axis=1).sum() 
    for c in map(list, filter(None, powerset(df)))
})

a                  13
b                   9
c                   8
d                   6
e                  10
f                  12
g                   9
h                   7
a_b                 7
...

Pairwise Combinations

This is a special case, and can be vectorized.

from itertools import combinations

u = df.T.dot(df)   
pd.DataFrame({
    'combination': [*map('_'.join, combinations(df, 2))], 
    # pandas < 0.24
    # 'count': u.values[np.triu_indices_from(u, k=1)]
    # pandas >= 0.24
    'count': u.to_numpy()[np.triu_indices_from(u, k=1)]
})

You can use dot, then extract the upper triangular matrix values:

  combination  count
0         a_b      7
1         a_c      7
2         a_d      5
3         a_e      8
4         a_f     10
5         a_g      7
6         a_h      6
7         b_c      6
8         b_d      4
9         b_e      9
Community
  • 1
  • 1
cs95
  • 379,657
  • 97
  • 704
  • 746
  • thanks for this - although I would like to have all combinations, not just pairwise so for example a_b_c_d_e_f_g_h or a_c_e_g – Dan Jul 08 '19 at 21:03
  • You can increase the combinations count for that. Currently, that is 2, you can see that in `combinations(df, 2)`, change it to maybe 8 if you want from a to h. Excellent answer btw! – amanb Jul 08 '19 at 21:05
  • @amanb Thanks but that only solves the problem of getting the indices, not the corresponding counts. The latter is a much harder problem that was not made clear from OP's post, and their expected output (well, their initial revision) was also misleading (it only shows the expected output for a_b, leading one to believe they only wanted pairwise counts). – cs95 Jul 08 '19 at 21:11
  • @Dan Check the edit, and please be more clear with your requirements in future. – cs95 Jul 08 '19 at 21:15
  • 2
    oOo hand't thought of using `.min`. Nice one – ALollz Jul 08 '19 at 21:21
  • @ALollz It is deceptively simple, but a lot of the calculations here are redundant. Perhaps there is a better way to do this, but I found this to be the most straightforward. – cs95 Jul 08 '19 at 21:32
2

As you happen to have 8 columns, np.packbits together with np.bincount is rather convenient here:

import numpy as np
import pandas as pd

# make large example
ncol, nrow = 8, 1_000_000
df = pd.DataFrame(np.random.randint(0,2,(nrow,ncol)), columns=list("abcdefgh"))

from time import time
T = [time()]
# encode as binary numbers and count
counts = np.bincount(np.packbits(df.values.astype(np.uint8)),None,256)

# find sets in other sets
rng = np.arange(256, dtype=np.uint8)
contained = (rng & rng[:, None]) == rng[:, None]

# and sum
ccounts = (counts * contained).sum(1)

# if there are empty bins, remove them
nz = np.where(ccounts)[0].astype(np.uint8)

# helper to build bin labels 
a2h = np.array(list("abcdefgh"))

# put labels to counts
result = pd.Series(ccounts[nz], index = ["_".join((*a2h[np.unpackbits(i).view(bool)],)) for i in nz])



from itertools import chain, combinations

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

T.append(time())
s = pd.Series({
    '_'.join(c): df[c].min(axis=1).sum() 
    for c in map(list, filter(None, powerset(df)))
})
T.append(time())

print("packbits {:.3f} powerset {:.3f}".format(*np.diff(T)))
print("results equal", (result.sort_index()[1:]==s.sort_index()).all())

This gives the same result as the powerset approach but literally 1000x faster:

packbits 0.016 powerset 21.974
results equal True
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
0

If you have just values of 1 and 0, you could do:

df= pd.DataFrame({
'a': [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1],
'b': [1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0],
'c': [1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1],
'd': [1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],
})

(df.a * df.b).sum()

This results in 4.

To get all combinations you can use combinations from itertools:

from itertools import combinations

analyze=[(col,) for col in df.columns]
analyze.extend(combinations(df.columns, 2))
for cols in analyze:
    num_ser= None
    for col in cols:
        if num_ser is None:
            num_ser= df[col]
        else:
            num_ser*= df[col]
    num= num_ser.sum()
    print(f'{cols} contains {num}')

This results in:

('a',) contains 4
('b',) contains 7
('c',) contains 11
('d',) contains 23
('a', 'b') contains 4
('a', 'c') contains 4
('a', 'd') contains 4
('b', 'c') contains 7
('b', 'd') contains 7
('c', 'd') contains 11
jottbe
  • 4,228
  • 1
  • 15
  • 31
0

Cooccurence matrix is all you need:

Let's construct an example first:

import numpy as np
import pandas as pd

mat = np.zeros((5,5))
mat[0,0] = 1
mat[0,1] = 1
mat[1,0] = 1
mat[2,1] = 1
mat[3,3] = 1
mat[3,4] = 1
mat[2,4] = 1
cols = ['a','b','c','d','e']
df = pd.DataFrame(mat,columns=cols)
print(df)

     a    b    c    d    e
0  1.0  1.0  0.0  0.0  0.0
1  1.0  0.0  0.0  0.0  0.0
2  0.0  1.0  0.0  0.0  1.0
3  0.0  0.0  0.0  1.0  1.0
4  0.0  0.0  0.0  0.0  0.0

Now we construct the cooccurence matrix:

# construct the cooccurence matrix:
co_df = df.T.dot(df)
print(co_df)

     a    b    c    d    e
a  2.0  1.0  0.0  0.0  0.0
b  1.0  2.0  0.0  0.0  1.0
c  0.0  0.0  0.0  0.0  0.0
d  0.0  0.0  0.0  1.0  1.0
e  0.0  1.0  0.0  1.0  2.0

Finally the result you need:

result = {}

for c1 in cols:
    for c2 in cols:
        if c1 == c2:
            if c1 not in result:
                result[c1] = co_df[c1][c2]
        else:

            if '_'.join([c1,c2]) not in result:
                result['_'.join([c1,c2])] = co_df[c1][c2]


print(result)



{'a': 2.0, 'a_b': 1.0, 'a_c': 0.0, 'a_d': 0.0, 'a_e': 0.0, 'b_a': 1.0, 'b': 2.0, 'b_c': 0.0, 'b_d': 0.0, 'b_e': 1.0, 'c_a': 0.0, 'c_b': 0.0, 'c': 0.0, 'c_d': 0.0, 'c_e': 0.0, 'd_a': 0.0, 'd_b': 0.0, 'd_c': 0.0, 'd': 1.0, 'd_e': 1.0, 'e_a': 0.0, 'e_b': 1.0, 'e_c': 0.0, 'e_d': 1.0, 'e': 2.0}
snowneji
  • 1,086
  • 1
  • 11
  • 25