1

I'm using this dataframe (df1) :

df1 =  pd.read_csv('City_Zhvi_AllHomes.csv',header=None)

Its content contains columns written in the format Year-Months(1-12). Now I have to create a new dataframe that contains data from Df1 divided in 4 quarters. Here's what I have done:

Month = ['00','01','02','03','04','05','06','07','08','09','10','11','12']

Year=['00','01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16']


for i in range(0,15):

      for j in range(0,12):

          newdf = pd.DataFrame[df1.loc[2,'20'+Year[i]+'-'+Month[j+1]] + df1.loc[2,'20'+Year[i]+'-'+Month[j+2]]+ df1.loc[2,'20'+Year[i]+'-'+Month[j+3]],columns = ['Q1','Q2','Q3','Q4'] ] 
          j = j + 3


print newdf . 

I have done this specifically for second row however I need to perform this for entire dataset

Now I get this error:

'type' object has no attribute 'getitem'

Here's the sample data :

0             CountyName  SizeRank  2000-01  2000-02  2000-03  2000-04   
1                 Queens         1      NaN      NaN      NaN      NaN   
2            Los Angeles         2   204400   207000   209800   212300   
3                   Cook         3   136800   138300   140100   141900   
4           Philadelphia         4    52700    53100    53200    53400   

newdf should contain :

Q1                      
(204400+207000+209800)

Similarly Q2,Q3,Q4

How do I proceed?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
harshit
  • 333
  • 1
  • 2
  • 13
  • http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html - there is built-in functionality for this. – John Zwinck Jan 24 '17 at 08:08
  • Can you add to your question sample data - 5-6 rows and desired output? Also maybe help [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – jezrael Jan 24 '17 at 08:11

1 Answers1

1

I think you can use resample by columns axis=1 with aggregate sum, but first is necessary set_index of columns CountyName and SizeRank and then convert column names first to_datetime and then to_period:

#remove header=None for first row in csv as columns
df =  pd.read_csv('City_Zhvi_AllHomes.csv')

df = df.set_index(['CountyName','SizeRank'])
#with real data set index of all not dates columns
#df = df.set_index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank'])
df.columns = pd.to_datetime(df.columns).to_period('M')
print (df)
                        2000-01   2000-02   2000-03   2000-04
CountyName   SizeRank                                        
Queens       1              NaN       NaN       NaN       NaN
Los Angeles  2         204400.0  207000.0  209800.0  212300.0
Cook         3         136800.0  138300.0  140100.0  141900.0
Philadelphia 4          52700.0   53100.0   53200.0   53400.0

print (df.columns)
PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04'], dtype='period[M]', freq='M')

df = df.resample('Q', axis=1).sum()
print (df)
                         2000Q1    2000Q2
CountyName   SizeRank                    
Queens       1              NaN       NaN
Los Angeles  2         621200.0  212300.0
Cook         3         415200.0  141900.0
Philadelphia 4         159000.0   53400.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Getting this error : 'Index' object has no attribute 'to_period' – harshit Jan 24 '17 at 08:35
  • This is the output or columns :([u'RegionID', u'RegionName', u'State', u'Metro', u'CountyName', u'SizeRank', u'2000-01', u'2000-02', u'2000-03', u'2000-04', u'2000-05', u'2000-06', u'2000-07', u'2000-08', u'2000-09', u'2000-10', u'2000-11', u'2000-12', u'2001-01', u'2001-02', u'2001-03', u'2001-04', u'2001-05', u'2001-06', u'2001-07', u'2001-08', u'2001-09', u'2001-10', u'2001-11', u'2001-12', u'2002-01', u'2002-02', u'2002-03', u'2002-04', u'2002-05', u'2002-06', u'2002-07', u'2002-08', u'2002-09', u'2002-10', u'2002-11', u'2002-12', ...], dtype='object') – harshit Jan 24 '17 at 09:12
  • Ok, then need `set_index` of all columns which are not dates like `df = df.set_index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank'])` – jezrael Jan 24 '17 at 09:13
  • so instead `df = df.set_index(['CountyName','SizeRank'])` need `df = df.set_index(['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank'])` – jezrael Jan 24 '17 at 09:16
  • Yeah . Got it . Thanks – harshit Jan 24 '17 at 09:19
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/133869/discussion-between-rahul-sharma-and-jezrael). – harshit Jan 24 '17 at 09:19