3

I have time-indexed data:

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3), date(2012, 1, 5)]), 'a' : pd.Series(['A', 'B', 'C']),
                   'b': pd.Series(['C', 'E', 'C']), 'c': pd.Series(['E', 'F', 'A']), 'd': pd.Series(['B', np.nan, 'E'])})
df2 = df2.set_index('day')
df2

            a   b   c   d
day         
2012-01-01  A   C   E   B 
2012-01-03  B   E   F   NaN
2012-01-05  C   C   A   E

What is the best way to reshape the table to get the frequency of each unique value each day?

for example, A occurs once on 1/01 and 1/05.

So the desired result would be:

            A   B   C   D   E   F  NaN
day         
2012-01-01  1   1   1   0   1   0   0
2012-01-03  0   1   0   0   1   1   1
2012-01-05  1   0   2   0   1   0   0

Many thanks!

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
D. Wei
  • 79
  • 1
  • 8
  • Does this answer your question? [Remove NaN values from pandas dataframe and reshape table](https://stackoverflow.com/questions/46235097/remove-nan-values-from-pandas-dataframe-and-reshape-table) – APhillips Jan 20 '20 at 22:29
  • @APhillips Thanks for the suggestion, but they seem not quite alike. – D. Wei Jan 20 '20 at 22:58

3 Answers3

3
df3 = df2.melt('day').pivot_table(index = 'day', columns = 'value', aggfunc= np.size, fill_value = 0)
df3.columns = df3.columns.droplevel(0)
df3.columns.name = None


            A   B   C   D   E   F
day                     
2012-01-01  1   1   1   0   1   0
2012-01-03  0   1   0   0   1   1
2012-01-05  1   0   1   1   1   0

You can do a melt, followed by a pivot to get what I believe your looking for.

Edit


This way is more concise that slipped my mind

df3 = df2.melt('day')
df3 = pd.crosstab(df3['day'], df3['value'])
df3.columns.name = None
Ben Pap
  • 2,549
  • 1
  • 8
  • 17
  • Thank you so much for your quick response! Your answer works good! melt + pivot_table looks very powerful. – D. Wei Jan 20 '20 at 22:54
  • Thanks, I forgot when just dealing with counts you can just use crosstab, see my edit for even simpler looking code. It essentially does the exact same thing. – Ben Pap Jan 20 '20 at 22:57
  • I also tried stack('day') + crosstab, it works in this problem, but seems not elegant enough as your melt('day') + crosstab – D. Wei Jan 20 '20 at 23:07
2

Let's try using pd.get_dummies, and sum with level parameter:

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3), date(2012, 1, 5)]), 'a' : pd.Series(['A', 'B', 'C']),
                   'b': pd.Series(['C', 'E', 'C']), 'c': pd.Series(['E', 'F', 'A']), 'd': pd.Series(['B', np.nan, 'E'])})
df2 = df2.set_index('day')

df2 = df2.fillna('NaN')
df_out = pd.get_dummies(df2, prefix='', prefix_sep='').T.sum(level=0).T
print(df_out)

Output:

            A  B  C  E  F  NaN
day                           
2012-01-01  1  1  1  1  0    0
2012-01-03  0  1  0  1  1    1
2012-01-05  1  0  2  1  0    0

or

instead of T.sum(level=0).T you can groupby over columns.

df_out = pd.get_dummies(df2, prefix='', prefix_sep='')
df_out = df_out.groupby(df_out.columns, axis=1).sum()
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

An easy way to do it would be to use the get_dummies method provided by pandas

Such as:

import pandas as pd
pd.get_dummies(df2, dummy_na=True)

Output

           a_A  a_B a_C a_nan   b_C b_D b_E b_nan   c_A c_E c_F c_nan   d_B d_E d_nan
day                                                         
2012-01-01  1   0   0    0     1    0   0   0      0    1   0   0      1    0   0
2012-01-03  0   1   0    0     0    0   1   0      0    0   1   0      0    0   1
2012-01-05  0   0   1    0     0    1   0   0      1    0   0   0      0    1   0

You could then rename the columns and combine them.

Rahul P
  • 2,493
  • 2
  • 17
  • 31