looking for better iteration approach for slicing a dataframe

Question

First post: I apologize in advance for sloppy wording (and possibly poor searching if this question has been answered ad nauseum elsewhere - maybe I don't know the right search terms yet).

I have data in 10-minute chunks and I want to perform calculations on a column ('input') grouped by minute (i.e. 10 separate 60-second blocks - not a rolling 60 second period) and then store all ten calculations in a single list called output.

The 'seconds' column records the second from 1 to 600 in the 10-minute period. If no data was entered for a given second, there is no row for that number of seconds. So, some minutes have 60 rows of data, some have as few as one or two.

Note: the calculation (my_function) is not basic so I can't use groupby and np.sum(), np.mean(), etc. - or at least I can't figure out how to use groupby.

I have code that gets the job done but it looks ugly to me so I am sure there is a better way (probably several).

output=[]

seconds_slicer = 0

for i in np.linspace(1,10,10):
    seconds_slicer += 60
    minute_slice = df[(df['seconds'] > (seconds_slicer - 60)) & 
                             (df['seconds'] <= seconds_slicer)]
    calc = my_function(minute_slice['input'])
    output.append(calc)

If there is a cleaner way to do this, please let me know. Thanks!

Edit: Adding sample data and function details:

seconds input
1   1   0.000054
2   2   -0.000012
3   3   0.000000
4   4   0.000000
5   5   0.000045

def realized_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

Seeing what your function does, or some sample input and expected output would help us to understand the nature of your question. On the face of it, it looks a lot like doing a standard resample in pandas, does this link help? [Pandas Resample Apply Custom Function?](https://stackoverflow.com/questions/41300653/pandas-resample-apply-custom-function) — G. Anderson, Aug 23 '21 at 22:20
I was trying to do this without converting the seconds column to datetime (but maybe that is what I should do). I am about to add a data sample and the details of the function in the original post. I left the function details out originally because I was focused on the slicing and didn't want to clutter with details. — Straken, Aug 24 '21 at 00:03

score 1 · Accepted Answer · answered Aug 24 '21 at 17:25

For this answer, we're going to repurpose Bin pandas dataframe by every X rows

We'll create a dataframe with missing data in the 'seconds' column, as I understand your data to be based on the description given

secs=[1,2,3,4,5,6,7,8,9,11,12,14,15,17,19]
data = [np.random.randint(-25,54)/100000 for _ in range(15)]

df=pd.DataFrame(data=zip(secs,data), columns=['seconds','input'])

df

    seconds input
0   1   0.00017
1   2   -0.00020
2   3   0.00033
3   4   0.00052
4   5   0.00040
5   6   -0.00015
6   7   0.00001
7   8   -0.00010
8   9   0.00037
9   11  0.00050
10  12  0.00000
11  14  -0.00009
12  15  -0.00024
13  17  0.00047
14  19  -0.00002

I didn't create 600 rows, but that's okay, we'll say we want to bin every 5 seconds instead of every 60. Now, because we're just trying to use equal time measures for grouping, we can just use floor division to see which bin each time interval would end up in. (In your case, you'd divide by 60 instead)

grouped=df.groupby(df['seconds'] // 5).apply(realized_volatility).drop('seconds', axis=1) #we drop the extra 'seconds' column because we don;t care about the root sum of squares of seconds in the df

grouped

        input
seconds 
0       0.000441
1       0.000372
2       0.000711
3       0.000505

Happy to help, thanks for providing the sample info so I could help — G. Anderson, Aug 25 '21 at 14:13

looking for better iteration approach for slicing a dataframe

1 Answers1