Question
I am using Pandas on Python 3.7.7. I would like to compute the mutual information between categorical values of a variable x
grouped by another variable's values y
. My data looks like the following table:
+-----+-----+
| x | y |
+-----+-----+
| x_1 | y_1 |
| x_2 | y_1 |
| x_3 | y_1 |
| x_1 | y_2 |
| x_2 | y_2 |
| x_4 | y_3 |
| x_6 | y_3 |
| x_9 | y_3 |
| x_1 | y_4 |
| ... | ... |
+-----+-----+
I would like have a data structure (a pandas MultiIndex series/dataframe or a numpy matrix or anything suitable) which stores the number of co-occurrences of (x_i
, x_j
) pair given a certain y_k
value. In fact, this would be great, for example, to easily compute PMI:
+-----+-----+--------+-------+
| x_i | x_j | cooc | pmi |
+-----+-----+--------+-------+
| x_1 | x_2 | | |
| x_1 | x_3 | | |
| x_1 | x_4 | | |
| x_1 | x_5 | | |
| ... | ... | ... | ... |
+-----+-----+--------+-------+
Is there any suitable-memory efficient way?
SIDE NOTE: I am using quite big data (40k distinct x
values and 8k distinct y
values, with a total of 300k (x
,y
) entries so a memory friendly and optimized approach would be great (maybe relying on third part libraries as Dask)
Update
Non-optimized solution
I came up with a solution using pd.crosstab. I provide here a small example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('xy'))
"""
df:
+-----+-----+
| x | y |
+-----+-----+
| 4 | 99 |
| 1 | 39 |
| 39 | 56 |
| .. | .. |
| 59 | 20 |
| 82 | 57 |
+-----+-----+
100 rows × 2 columns
"""
# Compute cross tabulation:
crosstab = pd.crosstab(df["x"], df["y"])
"""
crosstab:
+------+-----+-----+-----+-----+
| y | 0 | 2 | 3 | ... |
| x +-----+-----+-----+-----+
| 1 | 0 | 0 | 0 | ... |
| 2 | 0 | 0 | 0 | ... |
| ... | ... | ... | ... | ... |
+------+-----+-----+-----+-----+
62 rows × 69 columns
"""
# Initialize a pandas MultiIndex Series storing PMI values
import itertools
x_pairs = list(itertools.combinations(crosstab.index, 2))
pmi = pd.Series(0, index = pd.MultiIndex.from_tuples(x_pairs))
"""
pmi:
+-------------+-----+
| index | val |
+------+------| |
| x_i | x_j | |
+------+------+-----+
| 1 | 2 | 0 |
| | 4 | 0 |
| ... | ... | ... |
| 95 | 98 | 0 |
| | 99 | 0 |
| 96 | 98 | 0 |
+------+------+-----+
Length: 1891, dtype: int64
"""
Then, the loop I use to fill the Series is structured as follows:
for x1, x2 in x_pairs:
pmi.loc[x1, x2] = crosstab.loc[[x1, x2]].min().sum() / (crosstab.loc[x1].sum() * crosstab.loc[x2].sum())
This is not an optional solution, poorly performing even with small use cases.