I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.
Here is the original dataset and a sample screenshot is provided below.
The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values. Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06
Here is how I solved the problem.
First I defined a function quarter_rows()
which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row
import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
def quarter_rows(row):
for i in range(0, len(row), 3):
row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
return row
Now I do some subsetting and cleanup of the data to leave only what I need to work with
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]
I then used apply to apply the function to all rows.
housing3 = housing3.apply(quarter_rows, axis=1)
I get the expected result. A sample is shown below
But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.
I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows()
function.
I've tried python lambdas but every way I tried threw an exception.
I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.
Thanks