0

I'm working with a large data frame, but for simplicity’s sake, let's say I have a data frame with columns labeled: year, stat1, stat2 . . . statn

   year    stat1    stat2 . . . statn

0  1970   #          #      . . .  #

1  1971   #          #      . . .  #

2  1972   #          #      . . .  #

3  1973   #          #      . . .  #

. . .

997  2020    #          #      . . .  #

998  2021    #          #      . . .  #

999   2022    #          #      . . .  #

The year columns span from 1970-2022, and repeats after running through the whole iteration. So there are several 1970 rows, 1971 rows, 2022 rows, ect. But there is dropped missing data, so the pattern doesn't perfectly repeat.

What I am trying to do is to merge all duplicate years rows, and average all their data points (stat1, stat2 . . . statn). So the new modified DataFrame only has 52 rows (1970 - 2022), with all their data points having been averaged.

scotscotmcc
  • 2,719
  • 1
  • 6
  • 29

2 Answers2

1

You can use the groupby() function in pandas along with the mean() function to achieve this. Following is a sample code:

import pandas as pd

# Create a sample DataFrame
data = {'year': [1970, 1970, 1971, 1971, 1972, 1972, 1973, 1973],
        'stat1': [1, 2, 3, 4, 5, 6, 7, 8],
        'stat2': [8, 7, 6, 5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Group by 'year' and compute the mean of each group
result = df.groupby('year').mean().reset_index()

print(result)

The groupby() function groups the data by the 'year' column, and the mean() function computes the mean (average) for each group. The reset_index() function is used to reset the indices of the resulting DataFrame.

Output of the above sample code:

   year  stat1  stat2
0  1970    1.5    7.5
1  1971    3.5    5.5
2  1972    5.5    3.5
3  1973    7.5    1.5

Apply the same logic to your DataFrame, and you should get the required output with one row per year and averaged values for each column.

Bilesh Ganguly
  • 3,792
  • 3
  • 36
  • 58
0

Would be easier to test on our end reproducible, usable example, so not positive here with your data, but you should be able to get this with a df.groupby()

Something like...

grp = df.groupby('year').mean()

If you have columns other than year and your measures, you may need to fiddle around to either include or exclude those as appropriate.

scotscotmcc
  • 2,719
  • 1
  • 6
  • 29
  • glad it worked. if you mark the answer as 'approved' or 'accepted' or whatever it is / update, others can immediately see the question is resolved and this worked for you. that said, the answer by Bilesh Ganguly here is a bit more robust and explains more as well. – scotscotmcc Apr 14 '23 at 19:48