Crosstab (contingency table, or similar)

Question

I have a dataset like this

id	january	february	march	april
1	scheduled	done	null	done
2	scheduled	scheduled	done
3	ongoing	canceled	scheduled

I desire to transform this dataset in a matrix like this in output, where each cell are the occurrences of the exact intersection (keeping Null Values).

event	january	february	march	april
scheduled	2	1	1	0
done	0	1	1	1
ongoing	1	0	0	0
null	0	0	0	2

I don't quite understand... why there isn't `canceled` in the `event` column? Also, in `march` there's `null` but in the output you have `0`... — Andrej Kesely, May 30 '23 at 22:15
Given: `df = pd.read_html('https://stackoverflow.com/q/76368967/7758804')[0]` Then: `dfm = df.iloc[:, 1:].melt()` and `dfm.value_counts(dropna=False).unstack(level=0)`. `pd.crosstab(..., dropna=False)` still doesn't count the `NaN` / `null` values. See [code and plot](https://i.stack.imgur.com/FYZy0.png). You can drop whichever rows you want from there. — Trenton McKinney, May 30 '23 at 23:44
`dfm.value_counts(dropna=False).unstack(level=0).iloc[[3, 1, 2, 4], [2, 1, 3, 0]].fillna(0).astype(int)` - `null`, and `empty` / `NaN` are recognized as `NaN`. — Trenton McKinney, May 30 '23 at 23:47
More info following Andrej's comment: [How to make good reproducible pandas examples](/q/20109391/4518341), as well as [mre] in general. — wjandrea, May 31 '23 at 00:18
What have you tried, and what do you need help with exactly? There's this existing question that covers a bunch of relevant techniques, so you might find your answer there: [How can I pivot a dataframe?](/q/47152691/4518341) — wjandrea, May 31 '23 at 00:19

score 1 · Accepted Answer · answered May 31 '23 at 09:41

What you want cannot be done with pivot because your dataset is not in the standard structure. According to your dataset provide, for each month column, you use groupby function to count for the number of occurrences.

Sample data:

df = pd.DataFrame({'id':[1,2,3],
                   'jan':['schedule', 'schedule', 'ongoing'],
                   'feb':['done', 'schedule', 'canceled'],
                   'mar':['null', 'done', 'schedule'],
                   'apr':['done', np.NaN, np.NaN]})

Output:

temdf = pd.DataFrame()

for i in ['jan', 'feb', 'mar', 'apr']:
    tem = df.groupby(f'{i}').size().reset_index().rename(columns={f'{i}':'event', 0:f'{i}'}).set_index('event')
    temdf = pd.concat([temdf, tem], axis=1)

temdf.reset_index(inplace=True)

temdf


      event  jan  feb  mar  apr
0   ongoing  1.0  NaN  NaN  NaN
1  schedule  2.0  1.0  1.0  NaN
2  canceled  NaN  1.0  NaN  NaN
3      done  NaN  1.0  1.0  1.0
4      null  NaN  NaN  1.0  NaN

Crosstab (contingency table, or similar)

1 Answers1