2

I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.

Here is the original dataset and a sample screenshot is provided below.

enter image description here

The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values. Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06

Here is how I solved the problem.

First I defined a function quarter_rows() which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row

import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')

def quarter_rows(row):
    for i in range(0, len(row), 3):
        row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
    return row

Now I do some subsetting and cleanup of the data to leave only what I need to work with

p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]

I then used apply to apply the function to all rows.

housing3 = housing3.apply(quarter_rows, axis=1)

I get the expected result. A sample is shown below

enter image description here

But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.

I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows() function. I've tried python lambdas but every way I tried threw an exception. I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.

Thanks

Tagc
  • 8,736
  • 7
  • 61
  • 114
chidimo
  • 2,684
  • 3
  • 32
  • 47

1 Answers1

1

I think you can use instead apply use resample by quarters and aggregate mean, but first convert column names to month periods by to_period:

housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()

Testing:

housing = pd.read_csv('City_Zhvi_AllHomes.csv')
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
#for testing slect only 10 first rows and columns from jan 2000 to jun 2000
housing3 = housing.set_index(["State","RegionName"]).ix[:10, '2000-01' : '2000-06']
print (housing3)
                     2000-01   2000-02   2000-03   2000-04   2000-05   2000-06
State RegionName                                                              
NY    New York           NaN       NaN       NaN       NaN       NaN       NaN
CA    Los Angeles   204400.0  207000.0  209800.0  212300.0  214500.0  216600.0
IL    Chicago       136800.0  138300.0  140100.0  141900.0  143700.0  145300.0
PA    Philadelphia   52700.0   53100.0   53200.0   53400.0   53700.0   53800.0
AZ    Phoenix       111000.0  111700.0  112800.0  113700.0  114300.0  115100.0
NV    Las Vegas     131700.0  132600.0  133500.0  134100.0  134400.0  134600.0
CA    San Diego     219200.0  222900.0  226600.0  230200.0  234400.0  238500.0
TX    Dallas         85100.0   84500.0   83800.0   83600.0   83800.0   84200.0
CA    San Jose      364100.0  374000.0  384700.0  395700.0  407100.0  416900.0
FL    Jacksonville   88000.0   88800.0   89000.0   88900.0   89600.0   90600.0

housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
print (housing3)
                           2000Q1         2000Q2
State RegionName                                
NY    New York                NaN            NaN
CA    Los Angeles   207066.666667  214466.666667
IL    Chicago       138400.000000  143633.333333
PA    Philadelphia   53000.000000   53633.333333
AZ    Phoenix       111833.333333  114366.666667
NV    Las Vegas     132600.000000  134366.666667
CA    San Diego     222900.000000  234366.666667
TX    Dallas         84466.666667   83866.666667
CA    San Jose      374266.666667  406566.666667
FL    Jacksonville   88600.000000   89700.000000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I would like to add that the answer to my question led to another problem: how to convert the new columns in PeriodIndex format to string. This post will help http://stackoverflow.com/questions/34800343/python-pandas-convert-type-from-pandas-period-to-string – chidimo Jan 15 '17 at 18:04
  • Ok, you can do it. And if helpful, you can upvote. My or another answer. Both works nice. – jezrael Jan 15 '17 at 18:08