This is my first post here, so please let me know if I'm doing anything incorrectly. I'm also pretty new to Pandas, so I could easily be overlooking something obvious here.
I have a dataset for which I need to perform periodic averages. Within each of the periods, I also need separate averages for subsets that have different values in a certain column.
Here's a made up dataframe for illustrative purposes:
import pandas as pd
# subjects 1 and 2 pertain to math and science, respectively
df = pd.DataFrame([[1, 85, 3], [1, 70, 2], [2, 77, 4], [2, 96, 3],
[1, 91, 2], [1, 83, 4], [1, 97, 6], [1, 92, 2],
[1, 76, 1], [2, 80, 2], [2, 90, 4], [2, 99, 5]],
columns=['Subject', 'Score', 'StudyHours'])
I'd like to be able to average all column data within every x rows (I'll use x = 4 in this case) and also take the average for each subject within each set of 4 rows.
The first requirement is fulfilled here, and I know you can get a subset of data based on a condition within a column (for example, df_math = df[df['Subject'] == 1]), but I can't figure out/find a way to combine these functionalities. Both the rolling and groupby methods in the linked post produce their own type of object, which, from what I can tell, can't be separated based on a column condition in the same way a dataframe can.
I have a solution that works, but it isn't very elegant and it doesn't really seem to take much advantage of the efficient functions that pandas has for its dataframes, so I'm hoping someone more experienced with pandas could help me improve my solution:
from math import ceil
total_rows = len(df)
period = 4
num_per = ceil(total_rows / period)
num_sub = len(df['Subject'].unique())
# Create list to store DFs of averaged data
avg_vals = [ [] for _ in range(num_per) ]
avg_vals = [ avg_vals.copy() for _ in range(num_sub + 1) ]
i = 0
while i < total_rows:
# Define current set of x rows
section = df[i:i+period]
# Determine current row in averages list
row = int(i / period)
# Add DF of means to array of total averages list
avg_vals[0][row] = pd.DataFrame(section.mean()).transpose()
for j in range(1, num_sub+1):
# If there is any data for current subject...
if any(section['Subject'] == j):
# Add DF of current subject means to its averages list
avg_vals[j][row] = pd.DataFrame(section[section['Subject'] == j].mean()).transpose()
# increment i to move to next set of x rows
i = i + period
# Remove any empty list elements
for i in range(1, num_sub + 1):
avg_vals[i] = [ j for j in avg_vals[i] if type(j) != list ]
# Concatenate DFs of averages in each list and set correct indexing
for i in range(num_sub + 1):
avg_vals[i] = pd.concat(avg_vals[i]).reset_index(drop=True)
This gives me my desired dataframes in the avg_vals list as shown below:
# Total Averages: avg_vals[0]
Subject Score StudyHours
0 1.50 82.00 3.0
1 1.00 90.75 3.5
2 1.75 86.25 3.0
# Math Averages: avg_vals[1]
Subject Score StudyHours
0 1.0 77.50 2.5
1 1.0 90.75 3.5
2 1.0 76.00 1.0
# Science Averages: avg_vals[2]
Subject Score StudyHours
0 2.0 86.50 3.50
1 2.0 89.67 3.67