Average Data in all Rows that Meet a Condition for every X Rows in a Pandas Dataframe

Question

This is my first post here, so please let me know if I'm doing anything incorrectly. I'm also pretty new to Pandas, so I could easily be overlooking something obvious here.

I have a dataset for which I need to perform periodic averages. Within each of the periods, I also need separate averages for subsets that have different values in a certain column.

Here's a made up dataframe for illustrative purposes:

import pandas as pd
# subjects 1 and 2 pertain to math and science, respectively
df = pd.DataFrame([[1, 85, 3], [1, 70, 2], [2, 77, 4], [2, 96, 3],
                   [1, 91, 2], [1, 83, 4], [1, 97, 6], [1, 92, 2],
                   [1, 76, 1], [2, 80, 2], [2, 90, 4], [2, 99, 5]],
                   columns=['Subject', 'Score', 'StudyHours'])

I'd like to be able to average all column data within every x rows (I'll use x = 4 in this case) and also take the average for each subject within each set of 4 rows.

The first requirement is fulfilled here, and I know you can get a subset of data based on a condition within a column (for example, df_math = df[df['Subject'] == 1]), but I can't figure out/find a way to combine these functionalities. Both the rolling and groupby methods in the linked post produce their own type of object, which, from what I can tell, can't be separated based on a column condition in the same way a dataframe can.

I have a solution that works, but it isn't very elegant and it doesn't really seem to take much advantage of the efficient functions that pandas has for its dataframes, so I'm hoping someone more experienced with pandas could help me improve my solution:

from math import ceil
total_rows = len(df)
period = 4
num_per = ceil(total_rows / period)
num_sub = len(df['Subject'].unique())

# Create list to store DFs of averaged data
avg_vals = [ [] for _ in range(num_per) ]
avg_vals = [ avg_vals.copy() for _ in range(num_sub + 1) ]

i = 0
while i < total_rows:
    # Define current set of x rows
    section = df[i:i+period]
    # Determine current row in averages list
    row = int(i / period)
    # Add DF of means to array of total averages list
    avg_vals[0][row] = pd.DataFrame(section.mean()).transpose()
    for j in range(1, num_sub+1):
        # If there is any data for current subject...
        if any(section['Subject'] == j):
            # Add DF of current subject means to its averages list
            avg_vals[j][row] = pd.DataFrame(section[section['Subject'] == j].mean()).transpose()
    # increment i to move to next set of x rows
    i = i + period

# Remove any empty list elements
for i in range(1, num_sub + 1):
    avg_vals[i] = [ j for j in avg_vals[i] if type(j) != list ]
# Concatenate DFs of averages in each list and set correct indexing
for i in range(num_sub + 1):
    avg_vals[i] = pd.concat(avg_vals[i]).reset_index(drop=True)

This gives me my desired dataframes in the avg_vals list as shown below:

# Total Averages: avg_vals[0] 
    Subject    Score    StudyHours
0   1.50       82.00    3.0
1   1.00       90.75    3.5
2   1.75       86.25    3.0

# Math Averages: avg_vals[1]
    Subject    Score    StudyHours
0   1.0        77.50    2.5
1   1.0        90.75    3.5
2   1.0        76.00    1.0

# Science Averages: avg_vals[2]
    Subject    Score    StudyHours
0   2.0        86.50    3.50
1   2.0        89.67    3.67

If you could post the output dataframe you are looking for, I believe you will get an answer very soon. — SomeDude, Apr 22 '21 at 17:36
@SomeDude The dataframes I included at the end of my example solution (avg_vals[0], avg_vals[1], and avg_vals[2]) are what I'm looking for. The method I used had all three of them contained in the avg_vals list, but that isn't totally necessary for my goal. I just did it that way for data management purposes. — FredericoUnO, Apr 22 '21 at 18:35
I edited the original post to make the dataframes I'm looking for more clear. — FredericoUnO, Apr 22 '21 at 18:41

Average Data in all Rows that Meet a Condition for every X Rows in a Pandas Dataframe

0 Answers0