Problem
I have a pandas dataframe, and I need count how many rows are there where each unique entry in the dataframe occurs within the same row of each other entry.
Related but different posts
- Co-occurrence Matrix from list of words in Python: Similar question to mine, but does not start with a dataframe. Most answers use iterations. I hope a better solution exists in Pandas.
- Constructing a co-occurrence matrix in python pandas: This already starts with a dataframe where there are only 0 and 1 in the body (I guess representing the actual values?) but not the actual values.
- Convert Two column data frame to occurrence matrix in pandas: This post assumes there are two columns only, which is rather restrictive for the case discussed here
Reproducible Setup
import pandas as pd
import numpy as np
The dataframe:
df = pd.DataFrame({'a': ['A', 'A', 'B', 'B'],
'b': ['B', 'C', 'B', 'B'],
'c': ['C', 'A', 'C', 'A'],
'd': ['B', 'D', 'B', 'A']},
index=[0, 1, 2, 3])
ie:
+----+-----+-----+-----+-----+
| | a | b | c | d |
|----+-----+-----+-----+-----|
| 0 | A | B | C | B |
| 1 | A | C | A | D |
| 2 | B | B | C | B |
| 3 | B | B | A | A |
+----+-----+-----+-----+-----+
(Printed using this.)
What I have tried
I have tried to use the code from answer, & substituting these variables:
document = [list(each) for each in df.values]
names = list(np.unique(df.values))
It gave the wrong results:
A B C D
A 4 6 3 2
B 6 10 5 0
C 3 5 0 1
D 2 0 1 0
It is based on iteratations, so I would hope for a better solution.
Expected Output
+----+-----+-----+-----+-----+
| | A | B | C | D |
|----+-----+-----+-----+-----|
| A | nan | 2 | 2 | 1 |
| B | 2 | nan | 2 | 0 |
| C | 2 | 2 | nan | 1 |
| D | 1 | 0 | 1 | nan |
+----+-----+-----+-----+-----+
There are 2
rows where A
& B
both appears, so the value in the cell row A
column B
is 2
.
There are 2
rows where A
& C
both appears, so the value in the cell row A
column C
is 2
.
Question
How can I get this row-wise cooccurence matrix easily in Pandas? It would be great if I didn't have to loop through the values.
(pandas.Categorical might be some use, I haven't managed to make it work yet.)