0

I have a dataset comprising values of 1 or 0 which identify whether a given mineral (M) is present, or not, within a sample (S). Example below, but the dataset itself includes about 100 minerals across 160 samples.

import numpy as np
import pandas as pd

data = np.array([['S1', '1', '1', '0', '0'],
                 ['S2', '0', '1', '0', '1'],
                 ['S3', '1', '1', '1', '1'],
                 ['S4', '0', '0', '0', '1']])
                   

minerals = ['Sample', 'M1', 'M2', 'M3', 'M4']

df = pd.DataFrame(data, columns=minerals).set_index('Sample')

co_occurrence = pd.DataFrame(columns=minerals[1:], index=minerals[1:])

For every pair of minerals, I need to identify how frequently they co-occur together in a separate dataframe called co_occurrence; that is, I need to compare every column pair in df, identify where both minerals in those columns are present (1), sum those occurrences and enter a matching total in co_occurrence.

In the example given, the value for the pair M1:M2 in co-occurrence should be 2 as they occur together twice in df.

How do I go about doing this?

geolguy
  • 11
  • 1

1 Answers1

0

You can use the following:

from itertools import combinations

df = df.astype(int)

co_occurrence = (pd.Series({(c1,c2): (df[c1]&df[c2]).sum()
                            for c1,c2 in combinations(df.columns, 2)})
                   .unstack(-1)
                )

output:

     M2   M3   M4
M1  2.0  1.0  1.0
M2  NaN  1.0  2.0
M3  NaN  NaN  1.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
mozway
  • 194,879
  • 13
  • 39
  • 75
  • because seems this is not `co_occurrence` matrix, thinking. – jezrael Sep 13 '21 at 07:37
  • I add comment under question. – jezrael Sep 13 '21 at 07:44
  • @geolguy is this what you needed? there is another solution linked as potential duplicate which gives a different format – mozway Sep 13 '21 at 07:49
  • @mozway Yes this works well for me, thank you. I see the other solution ultimately gives you the same totals but readable pairwise in either direction, which was not essential here. – geolguy Sep 13 '21 at 08:24