1

I'm doing the following operation on a sorted dataset 'df_pre_decay' containing time-series dataset for multiple IDs and I want to decay my 'tactic' variables for each ID at different rates (coming from tactic_decay_dict).

The created variable for decayed tactic variable 'xyz' will have same value as the tactic variable in 1st mnth while for all the other mnth , it will be an addition of ((value of decayed tactic variable 'xyz' in the previous mnth) multiplied by rate) and ((value of tactic variable 'tactic' in the current mnth) multiplied by (1-rate))

time_col = 'mnth'
tactic =['overall_details','speaker_total','overall_samples_eu','copay_redemption_count','voucher_redemption_count','dtc']
tactic_decay_dict = dict.fromkeys(tactic,(60,70))

uniq = len(df_pre_decay[time_col].unique())

## Loops for variables and decay rate
for a in tactic_decay_dict:
    for b in tactic_decay_dict[a]:
        xyz = a+'_s'+str(b)
## Loops for iterating over each row in the dataset
        for i in range(len(df_pre_decay)):
            df_pre_decay[xyz] = np.where((i%uniq)!=0,
                                         (df_pre_decay[xyz].iloc[i-1])*b/100+
                                         (df_pre_decay[a].iloc[i])*(100-b)/100,
                                         df_pre_decay[a].iloc[i])

I want to optimize this section of code as it is taking more than 30 mins for 5 million+ rows in my dataset.

Edit: Please find the sample of dataset

ID  mnth    overall_details speaker_total   overall_samples_eu  copay_redemption_count  voucher_redemption_count    dtc
1   201701  3   1   10  9   3   6
1   201702  6   1   0   7   7   10
1   201703  10  8   7   8   9   10
1   201704  3   9   3   0   1   1
1   201705  9   0   8   9   6   4
1   201706  8   3   2   10  8   9
1   201707  3   10  3   0   5   6
1   201708  2   10  3   9   6   2
1   201709  1   3   7   10  8   0
1   201710  3   8   2   8   0   10
1   201711  6   7   4   8   5   6
1   201712  3   8   2   9   4   10
2   201701  7   4   7   4   10  2
2   201702  10  0   2   2   10  5
2   201703  10  6   4   10  5   3
2   201704  4   3   6   4   0   8
2   201705  7   8   9   10  6   10
2   201706  8   0   2   7   1   8
2   201707  10  2   8   1   9   4
2   201708  10  6   7   0   3   5
2   201709  10  10  3   8   9   0
2   201710  2   0   3   5   5   8
2   201711  1   8   0   7   3   4
2   201712  8   5   1   0   7   9
3   201701  2   2   7   7   1   2
3   201702  2   8   10  9   6   9
3   201703  10  5   8   5   9   4
3   201704  6   1   2   4   6   2
3   201705  6   9   4   4   3   0
3   201706  5   1   6   4   1   7
3   201707  0   7   6   9   5   6
3   201708  10  3   2   0   4   5
3   201709  5   8   6   4   10  4
3   201710  8   3   10  6   7   0
3   201711  7   5   6   3   1   10
3   201712  3   9   8   4   10  0

1 Answers1

0

I don't think your code will work as intended, because you effectively set the entire column df_pre_decay[xyz] to a single value in each round of the loop. You need either to loop through each row of the dataframe (for i in range(len(df_pre_decay))), or treat the columns as vectors (as np.where and other numpy functions do), but you are mixing up both. A vectorized approach will usually be much faster.

For a non-vectorized version, set column xyz to be the same as column a and then loop through the rows, setting the accumulating values where needed.

for a in tactic_decay_dict:
    for b in tactic_decay_dict[a]:
        xyz = a+'_s'+str(b)
## Loops for iterating over each row in the dataset
        df_pre_decay[xyz] = df_pre_decay[a]
        for i in range(len(df_pre_decay)):
            if i % uniq != 0:
                df_pre_decay[xyz].iloc[i] = (df_pre_decay[xyz].iloc[i-1] * b/100
                     + df_pre_decay[a].iloc[i] * (100 - b)/100)

Or an alternative version - not sure which will be faster:

for a in tactic_decay_dict:
    for b in tactic_decay_dict[a]:
        xyz = a+'_s'+str(b)
        column = []
        for i, x in enumerate(df_pre_decay[a]):
            if i % uniq == 0:
                current = x
            else:
                current = x * b/100 + current * (100-b)/100
            column.append(current)
        df[xyz] = column

To vectorize, you can break the columns into chunks and applies the cumulative decay function to each, using numpy.ufunc.accumulate.

for a in tactic_decay_dict:
    for b in tactic_decay_dict[a]:
        xyz = a+'_s'+str(b)
        decay_func = np.frompyfunc(lambda u, v: u * b / 100.0 + v * (100-b) / 100.0, 2, 1)
        decayed = np.array([])
        for top in range(0, len(df_pre_decay), uniq):
            chunk = df_pre_decay[a][top:top+uniq]
            decayed = np.concatenate((decayed, 
                                  decay_func.accumulate(chunk, dtype=np.object).astype(np.float)))
        df_pre_decay[xyz] = decayed

An alternative way would be to insert blank rows with null values in between the different IDs. You could then apply a single accumulating function to the whole columns:

# insert blank rows in the data
df.index = df.index + df.index // uniq
df.reindex(index=range(len(df) + len(df) // uniq))   

def get_decay_func(b):
    def inner(u, v):
        if pd.isnull(u) or pd.isnull(v):
            return v
        else:
            return u * b/100.0 + v * (100-b)/100.0
    return inner

for a in tactic_decay_dict:
   for b in tactic_decay_dict[a]:
        decay = get_decay_func(b).accumulate
        xyz = a+'_s'+str(b)
        df_pre_decay[xyz] = decay(df_pre_decay[a], dtype=np.object).astype(df.float)
Stuart
  • 9,597
  • 1
  • 21
  • 30