How can I find the average age for each row in this example. Each columns has its own age ranging from 0 to 90. I have read the csv with pandas.
Asked
Active
Viewed 284 times
0
-
Just lookup 'average' in pandas documentation, it's a simple function. – purpin Oct 07 '22 at 10:52
-
Does this answer your question? [pandas get column average/mean](https://stackoverflow.com/questions/31037298/pandas-get-column-average-mean) – ErnestBidouille Oct 07 '22 at 10:54
-
I does not work, ideally. The problem is that it only takes the average of the sum of each row, but it does not take into account how many occourances of each age. For example If there is 4 people that are aged 4. I need to mulitply 4 with 4 – Mads Oct 07 '22 at 10:55
-
Does this work? Create new column with averages of each row using `df['mean'] = df.mean(axis=1)` – Redox Oct 07 '22 at 11:04
-
No it does not work, i need to first sum up how many people has a certain age and then multiply it by the acctual age – Mads Oct 07 '22 at 11:06
2 Answers
0
I Assume all the age-related columns are named like Age_0, Age_1, Age_2, ...
import pandas as pd
import numpy as np
df = pd.read_csv("./data.csv")
ages = np.array([int(c.split("_")[-1]) for c in df.filter(regex="^Age_\d+$").columns])
df.loc[:, "avg_age"] = (df.filter(regex="^Age_\d+$").values * ages).sum(axis=1) / df.filter(regex="^Age_\d+$").values.sum(axis=1)

Mohammad
- 39
- 3
-
thank you, but i get this error message RuntimeWarning: invalid value encountered in true_divide copy.loc[:, "avg_age"] = (copy.filter(regex="^Age_\d+$").values * ages).sum(axis=1) / copy.filter(regex="^Age_\d+$").values.sum(axis=1) – Mads Oct 07 '22 at 11:24
-
it's a warning, not an error. the reason is that you have (at least) a row in which the sum of the number of observations is zero. For that row, the resulting `avg_age` would be NaN which is alright. if you want to get problematic row, try: df.loc[df.filter(regex="^Age_\d+$").sum(axis="columns") == 0, :] – Mohammad Oct 07 '22 at 11:29
-
-
you don't need to replace the observations with anything. In the average column, you will have some NaNs which is ok. – Mohammad Oct 07 '22 at 11:51
0
Assuming the values are the amount of people per age, here is one solution.
(Potentially not the fastest for large amount of data)
highest_age = 90
def row_age(x):
sum_age = 0
sum_people = 0
for i in range(0, highest_age + 1):
people = x[f"age_{i}"]
sum_people += people
sum_age += people * i
if sum_people > 0:
return sum_age / sum_people
df["average_age"] = df.apply(row_age, axis=1)

Felix
- 167
- 1
- 8