How to write a function to average every nth number in each array in a column of a dataframe?

Question

I have a pandas dataframe that looks like this:

FileName    Num     Onsets          Offsets          Durations         
FileName1   3       [19, ..., 1023] [188, ..., 1252] [169, ..., 229] 
FileName2   5       [52, ..., 2104] [472, ..., 2457] [420, ..., 353] 
FileName3   4       [18, ..., 1532] [356, ..., 2018] [338, ..., 486]

This shows the onset and offset times of events in a time series, the duration of each event, and the time between each event. Each time series is actually a repetition of a small set of events, and the number of events in each set is in the Num column. For instance, in the first row, Onsets, Offsets, and Durations may each have 12 values, meaning the underlying set of events repeated 4 times. In other words, the pattern looks like [a,b,c,a,b,c,a,b,c,a,b,c] in each of those columns.

I need to find the average of each position in the underlying set for the Durations column. That means the average of each a duration, each b duration, each c duration, and so on. Then I need these averages to be appended to the dataframe in a new column. That means the length of the arrays in the Averages column will be equal to the value in the Num column.

I assume the thing to do is create a function that contains a for loop that will iterate through each row, index each number in Durations, average every nth number according the value in the Num column, create and append a new dataframe to store those averages, and then append the original dataframe with the new one as a column.

I'm thinking it might look something like the following, but I'm new to Python and coding, so I'm not sure:

Duration = np.empty(len(Onsets))

def averages(data): 
    
    for ionset,onset in enumerate(onsets):

        Duration[ionset] = #what I described above

How can I achieve this?

score 1 · Answer 1 · answered Sep 09 '20 at 01:27

This can be achieved by using apply functionality in pandas Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

import pandas as pd

df = pd.DataFrame({
    "FileName": ["FileName1"],
    "Num": 3,
    "Onsets": [[10, 11, 12, 34, 53, 22, 56, 24, 63, 24, 35, 1]],
    "Offsets": [[13, 25, 2, 35, 63, 23, 63, 23, 765, 24, 6, 1]],
    "Durations": [[1, 356, 6, 2, 6, 2, 2 , 2, 6, 65, 23, 2]]
})

def calculate_average(num, values):
    n_values = len(values) / num
    averages = list()
    for i in range(0, num):
        summation = 0
        for j in range(0, int(n_values)):
            summation += values[j*num]
        averages.append(summation/n_values)
    return averages
df["Onsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Onsets"]), axis=1)
df["Offsets_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Offsets"]), axis=1)
df["Durations_avg"] = df.apply(lambda x: calculate_average(x["Num"], x["Durations"]), axis=1)

Thank you for your answer! It didn't quite work with my df, though. Within each row, every average was exactly the same. — user90664, Sep 09 '20 at 18:27

RichieV · Accepted Answer · 2020-09-09T02:12:13.143

This is one of the reasons why keeping iterable objects as values inside a dataframe should be avoided. Most pandas functions do not agree with this kind of structure.

That being said, you can get your solution through df.explode and df.transform among other tricks.

import pandas as pd
import numpy as np

# sample data
# please always provide a callable line of code with your data
# you can get it with df.head(10).to_dict('split')
# read more about this in https://stackoverflow.com/q/63163251/6692898
# and https://stackoverflow.com/q/20109391/6692898
np.random.seed(1)
df = pd.DataFrame({
    'FileName': ['FileName1', 'FileName2', 'FileName3'],
    'Num': [3, 5, 4],
    'Onsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
    'Offsets': [np.random.randint(1, 1000, 12) for _ in range(3)],
    'Durations': [np.random.randint(1, 1000, 12) for _ in range(3)]
})
print(df)

    FileName  Num                                             Onsets  \
0  FileName1    3  [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1  FileName2    5  [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2  FileName3    4  [908, 253, 491, 669, 926, 399, 563, 581, 216, ...

                                             Offsets  \
0  [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1  [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2  [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...

                                           Durations
0  [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1  [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2  [934, 314, 596, 167, 649, 289, 419, 779, 280, ...

The code

# work with a temp dataframe
df2 = df[['FileName', 'Num', 'Durations']].explode('Durations')
df2.Durations = df2.Durations.astype(int) # needed only because of how the sample was created
# should not be necessary with your dataframe

df2['tag'] = ( # add cyclic tags to each row, within each FileName
    df2.groupby('FileName').Durations.transform('cumcount') # similar to range(len(group))
    % df2.Num # get the modulo of the row number within the group
)

# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag']).Durations.mean() # get average
df2.rename('Duration_avgs', inplace=True)

# collect in a list by Filename and merge with original df
df = df.merge(df2.groupby('FileName').agg(list), on='FileName')

Output

    FileName  Num                                             Onsets  \
0  FileName1    3  [38, 236, 909, 73, 768, 906, 716, 646, 848, 96...
1  FileName2    5  [973, 584, 750, 509, 391, 282, 179, 277, 255, ...
2  FileName3    4  [908, 253, 491, 669, 926, 399, 563, 581, 216, ...

                                             Offsets  \
0  [479, 865, 87, 142, 394, 8, 320, 830, 535, 314...
1  [317, 210, 265, 729, 654, 628, 432, 634, 457, ...
2  [455, 918, 562, 314, 516, 965, 793, 498, 44, 5...

                                           Durations  \
0  [337, 622, 884, 298, 467, 16, 65, 197, 26, 368...
1  [904, 283, 666, 617, 23, 778, 708, 127, 280, 3...
2  [934, 314, 596, 167, 649, 289, 419, 779, 280, ...

                                      Durations_avgs
0                             [267.0, 506.25, 349.5]
1  [679.6666666666666, 382.3333333333333, 396.5, ...
2  [621.0, 419.6666666666667, 589.0, 344.66666666...

Update

Kshitij shows a good idea of defining a function for this (in case you want to do it for several columns). But is is good to avoid apply if it can be done with pandas native functions.

Here is a function that does that dynamically for any column:

def get_averages(df: pd.DataFrame, column: str) -> pd.DataFrame:
    '''
    Add a column inplace, with the averages
    of each `Num` cyclical item for each row
    '''
    # work with a new dataframe
    df2 = (
        df[['FileName', 'Num', column]]
        .explode('Durations', ignore_index=True)
    )
    
    # needed only because of how the sample was created
    # should not be necessary with your dataframe
    df2.Durations = df2.Durations.astype(int)
    
    df2['tag'] = ( # add cyclic tags to each row, within each FileName
        df2.groupby('FileName')[column]
            .transform('cumcount') # similar to range(len(group))
        % df2.Num # get the modulo of the row number within the group
    )
    
    # get averages and collect into lists
    df2 = df2.groupby(['FileName', 'tag'])[column].mean() # get average
    df2.rename(f'{column}_avgs', inplace=True)
    
    # collect in a list by Filename and merge with original df
    df2 = df2.groupby('FileName').agg(list)
    df = df.merge(df2, on='FileName')
    
    return df


df = get_averages(df, 'Durations')

This works great, thank you! And it turns out I did need the .astype(int) line for my df, so I'm glad you included that. — user90664, Sep 09 '20 at 18:25

How to write a function to average every nth number in each array in a column of a dataframe?

2 Answers2