0

I have a dataframe and a series of rates. My calculation is very simple:

new_row(n) = new_row(n-1)*rate + old_row(n)

I have 20 Columns in my dataframe. rate is a series of 20 (1 for each column). I have written a code using loops which take nearly 9 seconds to run. I believe that, it is not the ideal way of doing this exercise. I would like to find a Pythonic way of doing it.

data = pd.read_csv('data.csv')
ret_rate = pd.read_csv('Retention_Rate.csv')

ret_dat = data.copy()
for i in range(4, ret_dat.shape[1]):
    for j in range(1, ret_dat.shape[0]):
        if (ret_dat['MARKET_ID'][j] == ret_dat['MARKET_ID'][j-1]):
        ret_dat.iloc[j, i] = ret_dat.iloc[j, i] + ret_rate.iloc[i-4,0]*ret_dat.iloc[j-1, i]

ret_dat.to_csv('adstock_data_v3.csv')

I have put the data in a Google sheet.

halfer
  • 19,824
  • 17
  • 99
  • 186
Nil
  • 42
  • 5

3 Answers3

0

I'm not sure if this is what you were trying to achieve - but this is simpler and a direct translation of the formula u have given

data = pd.read_csv('data.csv')
ret_rate = pd.read_csv('Retention_Rate.csv',usecols=['rate'])
# since you require only the 5th column onwards
ret_data = data.ix[:,4:]

# you can apply broad operations over rows instead of individual cells now
for i in range(1,len(ret_data)):
    ret_data.iloc[i] = ret_data.iloc[i-1].multiply(ret_rate.rate.iloc[i-1]) + ret_dat.iloc[i]

ret_data.to_csv('your_filename.csv',sep=',')
OPhamster
  • 11
  • 1
  • 5
  • may be I am making a stupid mistake... but this only gives me first line of output. rest is all NaN. – Nil Feb 20 '17 at 17:20
0

Finally found a solution. At this point, the most optimal solution that I could find. Leveraging Signal processing from scipy.signal.lfilter. Here goes the solution:

for media_var in self.config.media_vars: adstocked_vals = lfilter([1], [1, -ret_rate[media_var]], data[media_var])

Thanks to all who gave different answers. It did help me in some way or other. Thanks to a solution by Warren Weckesser on a different thread here

Community
  • 1
  • 1
Nil
  • 42
  • 5
-1
  1. Just use the slice of data which is of interest

  2. The Series ret_rate is for demonstrating purposes a range with the length equal to the number of columns of data

  3. You multiply data with ret_rate along the columns (axis=1) and shift this DataFrame by one and add the initial DataFrame

in code:

data = pd.read_csv("data_so.csv").iloc[:,4:]
ret_rate = pd.Series(range(df.shape[1]), index=df.columns)
data.multiply(ret_rate, 1).shift() + data 

So all the calculation ist just one line of pandas.

For multiply() it is important that the index of ret_rate is equal to the column names of data.

elcombato
  • 473
  • 1
  • 4
  • 16
  • ret_rate has values in csv. facing difficulty in converting that into series with the index being the data.columns. any help with that please... `ret_rate_s = pd.Series(ret_rate.ix[:,0], index=data.columns)` gives me all NaN – Nil Feb 20 '17 at 17:37
  • To read the csv as a Series you should use the pandas function [`read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). Either the csv has only one column or you define the column to use with the parameter `usecols`. In either case if you set `squeeze=True` the `read_csv` function returns a Series instead of a DataFrame. And after reading the csv you can set `ret_rate.index = data.columns` – elcombato Feb 21 '17 at 05:44
  • It worked! @elcombato. I could create a series but then, only to realize that df.multiply cannot handle float type. Only Intergers allowed! bummer! – Nil Feb 22 '17 at 10:48
  • @Nil `df.multiply()` shouldn't have any problem with `float`! – elcombato Feb 22 '17 at 10:56
  • Correct! I found out that I had some text fields. my bad! – Nil Feb 22 '17 at 17:07
  • Hi @Nil if this or any answer has solved your question please consider [accepting it](http://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – elcombato Feb 24 '17 at 15:42
  • Finally figured out an answer and posted below. The solution you gave was almost working except for the fact that it was taking the older value of the previous row. I needed to take the latest value of the previous row. Thanks for your help anyways. Got to learn about the `shift()` function. – Nil Feb 25 '17 at 14:09