0

I'm pretty new to python programming. I read a csv file to a dataframe with median house price of each month as columns. Now I want to create columns to get the mean value of each quarter. e.g. create column housing['2000q1'] as mean of 2000-01, 2000-02, and 2000-03, column housing['2000q2'] as mean of 2000-04,2000-05, 2000-06]... raw dataframe named 'Housing'

I tried to use nested for loops as below, but always come with errors.

for i in range (2000,2017):
      for j in range (1,5):
            Housing[i 'q' j] = Housing[[i'-'j*3-2, i'-'j*3-1, i'_'j*3]].mean(axis=1)

Thank you!

Vincent
  • 59
  • 6
  • Thanks for the formatting!@Woody Pride – Vincent Mar 26 '17 at 17:40
  • I would suggest transposing (so cities are columns and rows are dates) and then do something like: `df.groupby(pd.TimeGrouper(freq='3M')).sum()`. Note: make sure your date are of type datetime. – JohnE Mar 26 '17 at 18:30
  • Also, much better if you can include sample data as plain text that can be copied and pasted rather than an image. It makes it easier for others to read in your data and test answers. – JohnE Mar 26 '17 at 18:32

2 Answers2

0

Usually, we work with data where the rows are time, so it's good practice to do the same and transpose your data by starting with df = Housing.set_index('CountyName').T (also, variable names should usually start with a small letter, but this isn't important here).

Since your data is already in such a nice format, there is a pragmatic (in the sense that you need not know too much about datetime objects and methods) solution, starting with df = Housing.set_index('CountyName').T:

df.reset_index(inplace = True) # This moves the dates to a column named 'index'
df.rename(columns = {'index':'quarter'}, inplace = True) # Rename this column into something more meaningful
# Rename the months into the appropriate quarters
df.quarter.str.replace('-01|-02|-03', 'q1', inplace = True)
df.quarter.str.replace('-04|-05|-06', 'q2', inplace = True)
df.quarter.str.replace('-07|-08|-09', 'q3', inplace = True)
df.quarter.str.replace('-10|-11|-12', 'q4', inplace = True)
df.drop('SizeRank', inplace = True) # To avoid including this in the calculation of means
c = df.notnull().sum(axis = 1) # Count the number of non-empty entries
df['total'] = df.sum(axis = 1) # The totals on each month
df['c'] = c # only ssign c after computing the total, so it doesn't intefere with the total column
g = df.groupby('quarter')[['total','c']].sum()
g['q_mean'] = g['total']/g['c']
g

g['q_mean'] or g[['q_mean']] should give you the required answer.

Note that we needed to compute the mean manually because you had missing data; otherwise, df.groupby('quarter').mean().mean() would have immediately given you the answer you needed.

A remark: the technically 'correct' way would be to convert your dates into a datetime-like object (which you can do with the pd.to_datetime() method), then run a groupby with a pd.TimeGrouper() argument; this would certainly be worth learning more about if you are going to work with time-indexed data a lot.

Ken Wei
  • 3,020
  • 1
  • 10
  • 30
0

You can achieve this using pandas resampling function to compute quarterly averages in a very simple way.

pandas resampling: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

offset names summary: pandas resample documentation

In order to use this function, you need to have only time as columns, so you should temporarily set CountryName and SizeRank as indexes.

Code:

QuarterlyAverage = Housing.set_index(['CountryName', 'SizeRank'], append = True)\
                          .resample('Q', axis = 1).mean()\
                          .reset_index(['CountryName', 'SizeRank'], drop = False)

Thanks to @jezrael for suggesting axis = 1 in resampling

Community
  • 1
  • 1
FLab
  • 7,136
  • 5
  • 36
  • 69
  • transpose is not necesary, only need `axis=1` - `QuarterlyAverage = Housing.resample('Q', axis=1).mean()` – jezrael Mar 27 '17 at 08:15
  • Thanks for your comment, I have edited my answer. Hard to say without knowing more about data/application, but I think the base case is to have time and index, don't you agree? – FLab Mar 27 '17 at 08:31
  • 1
    If check [image](https://i.stack.imgur.com/8rnEX.jpg) in question, it seems first and second column is necessary set to index first, then resample and last reset index. – jezrael Mar 27 '17 at 08:34