Python help optimize this function

Question

data = 
        Symbol   Value  Day
0         AACG  1.8708    1
1         AACG  1.8500    2
2         AACG  1.8869    3
3         AACG  1.8200    4
4         AACG  1.8578    5
...        ...     ...  ...
3407024   ZYXI   5.25    1
3407025   ZYXI   4.96    2
3407026   ZYXI   4.99    3
3407027   ZYXI   4.99    4
3407028   ZYXI   4.95    5
...        ...    ...  ...
3407250   ZYXI  8.1500  227
3407251   ZYXI  8.2600  228
3407252   ZYXI  8.3900  229
3407253   ZYXI  8.1200  230
3407254   ZYXI  8.0700  231

import pandas as pd
import numpy as np

for index, row in data.iterrows():
    for i in range(1, 91):
        cstr = 'day-' + str(i)
        val = 'NaN'
        try:
            val = float(data[np.logical_and(data['Symbol'] == row['Symbol'],
                            data['Day'] == row['Day'] - i)].Value)
        except:
            val = 'NaN'
        data.loc[index,cstr] = val

The function loops through each row in the data frame

for each row in the data frame, it loops 90 times (i)

for each loop, it adds a column with value

value is the value in data frame with the same symbol as row but day as day from row minus i

output =
  Symbol   Value  Day   day-1   day-2   day-3   day-4... day-89 day-90
0   AACG  1.8708    1     NaN     NaN     NaN     NaN
1   AACG  1.8500    2  1.8708     NaN     NaN     NaN
2   AACG  1.8869    3  1.8500  1.8708     NaN     NaN
3   AACG  1.8200    4  1.8869  1.8500  1.8708     NaN
4   AACG  1.8578    5  1.8200  1.8869  1.8500  1.8708
5   AACG  1.8709    6  1.8578  1.8200  1.8869  1.8500
6   AACG  1.8700    7  1.8709  1.8578  1.8200  1.8869
7   AACG  1.8800    8  1.8700  1.8709  1.8578  1.8200
8   AACG  1.8000    9  1.8800  1.8700  1.8709  1.8578
9   AACG  1.7900   10  1.8000  1.8800  1.8700  1.8709

Please repeat [on topic](https://stackoverflow.com/help/on-topic) and [how to ask](https://stackoverflow.com/help/how-to-ask) from the [intro tour](https://stackoverflow.com/tour). You seem to be looking for a code review for a problem you haven't characterized -- either of which suggests that Stack Overflow is not the right place for your issue. — Prune, Jan 24 '21 at 07:25
What is your expected output? What are you trying to do. Can you please explain in plain english — Joe Ferndz, Jan 24 '21 at 07:34
@JoeFerndz I am basically add the previous 90 days values as columns to each row, so each row is a set of the previous 90 days values and could be charted — GFG, Jan 24 '21 at 07:35
you want 90 columns? what will the value be in each of the 90 columns? Can you take 2 rows and give us the example output — Joe Ferndz, Jan 24 '21 at 07:37

Pygirl · Answer 1 · 2021-01-24T08:28:03.760

1

try using shift and pd.concat

N = 5
df_new = pd.DataFrame()
for i,grp in df.groupby('Symbol'):
    l = pd.concat([grp['Value'].shift(i).rename(f'Day_{i}') for i in range(1,N)], axis=1)
    final_df = pd.concat([grp, l], axis=1)
    df_new = df_new.append(final_df)

OR

def f(x):
    x['Day-0'] = x['Value']
    for i in range(1,N+1):
        x[f'Day-{i}'] = x[f'Day-{i-1}'].shift()
    x.drop('Day-0', inplace=True ,axis=1)
    return x

final_df = df.groupby('Symbol').apply(f)

**final_df:"

edited Jan 24 '21 at 08:28

answered Jan 24 '21 at 08:02

Pygirl

12,969
5
30
43

when the `Symbol` changes from `AACG` to `ZYXI`, it will carry forward the last value of `AACG` which is incorrect. We have to solve for that as well. Otherwise it will NOT meet OP requirement. Thats the same challenge I have – Joe Ferndz Jan 24 '21 at 08:04
@JoeFerndz: Then groupby Symbol and for each symbol computing shift and concatenting them vertically and in the final step all the group will be merged horizontally. That will do. I will update my answer. Thanks for pointing this out. I din't see that – Pygirl Jan 24 '21 at 08:06
You just need to solve for Day_1 using the groupby(Symbol), then use Day_1 as the base to shift(i) where i ranges from 2 thru 90 – Joe Ferndz Jan 24 '21 at 08:17
I used the second function here and it worked – GFG Jan 24 '21 at 08:34
Could you please tell me how this works, what you were thinking, why this is much faster than what I was doing? – GFG Jan 24 '21 at 08:35
1

Well using apply is not recommended as it is slow but in you case I couldn't think of any approach. SO what happen is like you are making a group by symbols then for each symbol you are calculating a shift from 1 to 5(included). That function f(x) does that shifting part. – Pygirl Jan 24 '21 at 08:40
1

@GFG: Don't use `Itterrows`.Using `apply` is much better than it. https://stackoverflow.com/a/55557758/6660373 – Pygirl Jan 24 '21 at 08:41
1

upvoted your answer. apply(f) is also a neat trick. – Joe Ferndz Jan 24 '21 at 09:04

Joe Ferndz · Accepted Answer · 2021-01-24T08:52:59.120

You can do the following:

Step 1: Create a dictionary with 90 columns and assign np.nan to it. Each column will have Day_ + str(i) where i ranges from 1 thru 90.
Step 2: Create a dataframe with those 90 columns.
Step 3: Concatenate the dataframe to your original dataframe. Now you have the 90 additional columns with np.NaN as values.
Step 4: Now do shift(1) of Value after groupby(Symbol)
Step 5: Now iterate from 2 thru 90 and do shift(1) of Day_1. That will give you all the required values.

The code to do this is:

c = ['Symbol','Value','Day']
d = [['AACG',1.8708,1],
     ['AACG',1.8500,2],
     ['AACG',1.8869,3],
     ['AACG',1.8200,4],
     ['AACG',1.8578,5],
     ['ZYXI',5.25,1],
     ['ZYXI',4.96,2],
     ['ZYXI',4.99,3],
     ['ZYXI',4.99,4],
     ['ZYXI',4.95,5]]

import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
cols = {'Day_'+str(i):np.NaN for i in range(1,91)}

df = pd.concat([df,pd.DataFrame(cols,index=df.index)], axis=1)

for i in range (1,91):
    df['Day_'+str(i)] = df.groupby(['Symbol'])['Value'].transform(lambda x:x.shift(i))

print (df)

The output of this will be:

  Symbol   Value  Day   Day_1   Day_2  ...  Day_86  Day_87  Day_88  Day_89  Day_90
0   AACG  1.8708    1     NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN
1   AACG  1.8500    2  1.8708     NaN  ...     NaN     NaN     NaN     NaN     NaN
2   AACG  1.8869    3  1.8500  1.8708  ...     NaN     NaN     NaN     NaN     NaN
3   AACG  1.8200    4  1.8869  1.8500  ...     NaN     NaN     NaN     NaN     NaN
4   AACG  1.8578    5  1.8200  1.8869  ...     NaN     NaN     NaN     NaN     NaN
5   ZYXI  5.2500    1     NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN
6   ZYXI  4.9600    2  5.2500     NaN  ...     NaN     NaN     NaN     NaN     NaN
7   ZYXI  4.9900    3  4.9600  5.2500  ...     NaN     NaN     NaN     NaN     NaN
8   ZYXI  4.9900    4  4.9900  4.9600  ...     NaN     NaN     NaN     NaN     NaN
9   ZYXI  4.9500    5  4.9900  4.9900  ...     NaN     NaN     NaN     NaN     NaN

I will create a dataframe with 90+ rows for AACG and show you the results so you can see that Day_90 will have the right value.

I added a few more records to show you that the 90 days column gets filled.

>>> df.iloc[80:100]
   Symbol   Value  Day   Day_1   Day_2  ...  Day_86  Day_87  Day_88  Day_89  Day_90
80   AACG  1.8659   81  1.8658  1.8657  ...     NaN     NaN     NaN     NaN     NaN
81   AACG  1.8660   82  1.8659  1.8658  ...     NaN     NaN     NaN     NaN     NaN
82   AACG  1.8661   83  1.8660  1.8659  ...     NaN     NaN     NaN     NaN     NaN
83   AACG  1.8662   84  1.8661  1.8660  ...     NaN     NaN     NaN     NaN     NaN
84   AACG  1.8663   85  1.8662  1.8661  ...     NaN     NaN     NaN     NaN     NaN
85   AACG  1.8664   86  1.8663  1.8662  ...     NaN     NaN     NaN     NaN     NaN
86   AACG  1.8665   87  1.8664  1.8663  ...  1.8708     NaN     NaN     NaN     NaN
87   AACG  1.8666   88  1.8665  1.8664  ...  1.8500  1.8708     NaN     NaN     NaN
88   AACG  1.8667   89  1.8666  1.8665  ...  1.8869  1.8500  1.8708     NaN     NaN
89   AACG  1.8668   90  1.8667  1.8666  ...  1.8200  1.8869  1.8500  1.8708     NaN
90   AACG  1.8669   91  1.8668  1.8667  ...  1.8578  1.8200  1.8869  1.8500  1.8708
91   AACG  1.8670   92  1.8669  1.8668  ...  1.8584  1.8578  1.8200  1.8869  1.8500
92   AACG  1.8671   93  1.8670  1.8669  ...  1.8585  1.8584  1.8578  1.8200  1.8869
93   AACG  1.8672   94  1.8671  1.8670  ...  1.8586  1.8585  1.8584  1.8578  1.8200
94   AACG  1.8673   95  1.8672  1.8671  ...  1.8587  1.8586  1.8585  1.8584  1.8578
95   ZYXI  5.2500    1     NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN
96   ZYXI  4.9600    2  5.2500     NaN  ...     NaN     NaN     NaN     NaN     NaN
97   ZYXI  4.9900    3  4.9600  5.2500  ...     NaN     NaN     NaN     NaN     NaN
98   ZYXI  4.9900    4  4.9900  4.9600  ...     NaN     NaN     NaN     NaN     NaN
99   ZYXI  4.9500    5  4.9900  4.9900  ...     NaN     NaN     NaN     NaN     NaN

`ZYXI` will get the value from `AACG` using this : `df['Day_'+str(i)] = df['Day_1'].shift(i-1)` — Pygirl, Jan 24 '21 at 08:30
Took care of it by using the groupby.transform option and using shift(i) inside lambda function. — Joe Ferndz, Jan 24 '21 at 08:39
@JoeFerndz This works now and I think its a bit faster than the other. Could you please tell me how this works, what you were thinking, why this is much faster than what I was doing? — GFG, Jan 24 '21 at 08:49
I am iterating through the data from 1 thru 90 as we need to shift(1) for each column. While we do that, we also should consider the `Symbol`. Shift can happen only if the `Symbol` is the same. So I am grouping by `Symbol` and shifting the value in the column `Value` by 1 for each iteration. — Joe Ferndz, Jan 24 '21 at 08:55
Both @Pygirl and I are doing the same thing. Pygiri is iterating by each group and then shifting for it 90 times. So two loops vs. one loop in mine. Thats the only difference — Joe Ferndz, Jan 24 '21 at 08:58
You are iterating through the entire dataframe. one row at a time. By using groupby, you are able to get the value immediately and you don't need to iterate through each row. As you groupby, we are using transform. That sends back a results as the same length of the dataframe. For each row that it sends back, we are doing the shift(1). That way it automatically shifts. And I am doing it only 90 times in total. Thats why it is faster — Joe Ferndz, Jan 24 '21 at 09:00
In your code, you are computing the value each time. Here we are just shifting the row value by 1 which is quicker than computing each row each time — Joe Ferndz, Jan 24 '21 at 09:01

Python help optimize this function

2 Answers2