0

I have a dataset like this

id january february march april
1 scheduled done null done
2 scheduled scheduled done
3 ongoing canceled scheduled

I desire to transform this dataset in a matrix like this in output, where each cell are the occurrences of the exact intersection (keeping Null Values).

event january february march april
scheduled 2 1 1 0
done 0 1 1 1
ongoing 1 0 0 0
null 0 0 0 2
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Jresearcher
  • 297
  • 3
  • 13
  • 1
    I don't quite understand... why there isn't `canceled` in the `event` column? Also, in `march` there's `null` but in the output you have `0`... – Andrej Kesely May 30 '23 at 22:15
  • Given: `df = pd.read_html('https://stackoverflow.com/q/76368967/7758804')[0]` Then: `dfm = df.iloc[:, 1:].melt()` and `dfm.value_counts(dropna=False).unstack(level=0)`. `pd.crosstab(..., dropna=False)` still doesn't count the `NaN` / `null` values. See [code and plot](https://i.stack.imgur.com/FYZy0.png). You can drop whichever rows you want from there. – Trenton McKinney May 30 '23 at 23:44
  • `dfm.value_counts(dropna=False).unstack(level=0).iloc[[3, 1, 2, 4], [2, 1, 3, 0]].fillna(0).astype(int)` - `null`, and `empty` / `NaN` are recognized as `NaN`. – Trenton McKinney May 30 '23 at 23:47
  • More info following Andrej's comment: [How to make good reproducible pandas examples](/q/20109391/4518341), as well as [mre] in general. – wjandrea May 31 '23 at 00:18
  • What have you tried, and what do you need help with exactly? There's this existing question that covers a bunch of relevant techniques, so you might find your answer there: [How can I pivot a dataframe?](/q/47152691/4518341) – wjandrea May 31 '23 at 00:19

1 Answers1

1

What you want cannot be done with pivot because your dataset is not in the standard structure. According to your dataset provide, for each month column, you use groupby function to count for the number of occurrences.

Sample data:

df = pd.DataFrame({'id':[1,2,3],
                   'jan':['schedule', 'schedule', 'ongoing'],
                   'feb':['done', 'schedule', 'canceled'],
                   'mar':['null', 'done', 'schedule'],
                   'apr':['done', np.NaN, np.NaN]})

Output:

temdf = pd.DataFrame()

for i in ['jan', 'feb', 'mar', 'apr']:
    tem = df.groupby(f'{i}').size().reset_index().rename(columns={f'{i}':'event', 0:f'{i}'}).set_index('event')
    temdf = pd.concat([temdf, tem], axis=1)

temdf.reset_index(inplace=True)

temdf


      event  jan  feb  mar  apr
0   ongoing  1.0  NaN  NaN  NaN
1  schedule  2.0  1.0  1.0  NaN
2  canceled  NaN  1.0  NaN  NaN
3      done  NaN  1.0  1.0  1.0
4      null  NaN  NaN  1.0  NaN
PTQuoc
  • 938
  • 4
  • 13