1

I have a csv file with dates as columns headers and binary a matrix of 1, 0 or np.nan.

I'd like to take the mean of each index, grouped by month. I am running into a problem because my columns are not a datetimeindex, which I try to convert to with pd.to_datetime() with no luck.

binary.csv:

2016-01-01 00:00:00,2016-01-02 00:00:00,2016-02-01 00:00:00,2016-02-02 00:00:00
1,,0,1
0,1,,1

My Code:

import pandas as pd
import numpy as np

df = pd.read_csv('binary.csv')
df.columns = pd.to_datetime(df.columns, format='%Y-%m-%d %H:%M:%S')
df = df.groupby(pd.TimeGrouper(freq='M'), axis=0)
print df

Error:

TypeError: axis must be a DatetimeIndex, but got an instance of 'Int64Index'

Desired Output:

   2016-01-01 00:00:00  2016-02-01 00:00:00
0                  1.0                  0.5
1                  0.5                  1.0

Updated question:

Based on best answer:

If I wanted to a single value for each month, is there a more efficient way to do that than this?

pd.DataFrame(data=df.resample('MS', axis=1).mean().mean()).transpose()
user2242044
  • 8,803
  • 25
  • 97
  • 164

1 Answers1

2

By default, pd.TimeGrouper works on the index (axis=0) so you need to tell it that it should group the columns instead:

df.groupby(pd.TimeGrouper(freq='MS', axis=1), axis=1).mean()
Out: 
   2016-01-01  2016-02-01
0         1.0         0.5
1         0.5         1.0

You can directly use resample, too:

df.resample('MS', axis=1).mean()
Out: 
   2016-01-01  2016-02-01
0         1.0         0.5
1         0.5         1.0
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Thanks I confused the axis labels! What is the difference between `freq='MS'` vs `freq='M'`? I can't find it in the documentation? Your second method seems make the dates into index values (averaging both rows together. Does it need a `groupby` also? – user2242044 Dec 06 '16 at 17:19
  • `MS` is for the start of the month (see http://stackoverflow.com/a/17001474/2285236). I did it so it matches your output. With M it would give 2016-01-31 and 2016-02-29. Can you check your pandas version with `pd.__version__`? That display issue seems like a bug. – ayhan Dec 06 '16 at 17:25
  • Thanks for clarifying! I am using version 0.17.0. Your first method does solve my problem, but I am curious about the second one and why it's different for me. – user2242044 Dec 06 '16 at 17:31
  • Yes the second one should be the same as the first one, they mainly do the same thing (no need to do groupby, resample covers that). Yours is probably a version issue. If possible, I suggest you upgrade to the latest version (0.19.1) – ayhan Dec 06 '16 at 17:34
  • I upgraded versions and now the outputs match! I realize I actually wanted to then average all the values for each month so that each month has a single average value. I updated my question with a solution that works, but seems like there is a better way to do it. Any suggestions? – user2242044 Dec 06 '16 at 18:21
  • Taking the mean first over rows and then columns would give a different result than taking it over columns then rows. Was that intentional? Either way, if you want to aggregate on both axis you'll need two sums or two means. Maybe a cleaner version would be `df.sum().resample('MS').sum() / 4` but 4 is hardcoded here. – ayhan Dec 06 '16 at 19:21