How do I find median and mean of certain values in a column?

Question

I have a large csv file, and I am trying to find the median and the mean values of certain values in a column. One of my columns is titled 'Race' and another is called 'debt_to_income_ratio'. Within the Race column, the four options are 'White', 'Black', 'Hispanic', and 'Other'. The 'debt_to_income_ratio' column has a number in it indicating the debt to income ratio of whatever the race is in the 'Race' column. I am trying to get a median and mean debt to income ratio for each race (white, black, hispanic, and other).

The code I am currently using is:

df['race average'] = df.groupby('Race')['debt_to_income_ratio'].transform('mean') %>%
df['race median'] = df.groupby('Race')['debt_to_income_ratio'].transform('median')

I'm not really sure what I should be doing, so thanks in advance for any help!

This is python or R ? Seems like a chimera... Can you clarify which programming language is this intended for and also can you share df by doing dput(head(df)) and pasting the output? — StupidWolf, Jun 28 '20 at 17:59
If this is a question about computing summary statistics by group of one variable, then it is a frequent duplicate. See [1](https://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group), [2](https://stackoverflow.com/questions/6053620/calculate-group-mean-or-other-summary-stats-and-assign-to-original-data). — Rui Barradas, Jun 28 '20 at 18:03
I use the code that the code suggested in 2, which was: group_by(Race) %>% mutate(Race.mean.values = mean(debt_to_income_ratio)) . A new column was created, but it all of the values were NA. — Lauren, Jun 28 '20 at 18:17
We don't have your data, and your only code is both in python (not R) and not completely correct python code (`%>%`?). Please spend a moment to improve this question to be a minimal reprex, where we have some representative data to play with. (Unambiguous data is best served with `dput(head(x))` or `data.frame(...)`, depending on several factors.) From there, if you have preferences for R "ecosystems" like base, `dplyr`, or `data.table`, please be explicit, otherwise answers might encourage packages with which you are not familiar. — r2evans, Jun 28 '20 at 18:39

akrun · Accepted Answer · 2020-06-28T19:37:55.763

1

We can use dplyr to do this

library(dplyr)
df %>%
    group_by(Race) %>%
    mutate(Mean = mean(debt_to_income_ratio, na.rm = TRUE),
           Median = median(debt_to_income_ratio, na.rm = TRUE))

edited Jun 28 '20 at 19:37

answered Jun 28 '20 at 18:55

akrun

874,273
37
540
662

I tried this, and it came up with a column titled mean, but all of the values are NA. Do you have any ideas what would be causing this? – Lauren Jun 28 '20 at 19:37
@Lauren if there is any NA element, it returns NA, so you have use `na.rm = TRUE` by default, it is `na.rm = FALSE` in `mean` or `median` – akrun Jun 28 '20 at 19:38

Jelmer · Answer 2 · 2020-06-29T06:45:18.243

0

An option based on the base R aggregate function. Is this what you mean?

race_median = aggregate(debt_to_income_ratio ~ Race, data = df, FUN = function(x) quantile(x, 0.5, na.rm = T))
race_mean   = aggregate(debt_to_income_ratio ~ Race, data = df, FUN = "mean")

edited Jun 29 '20 at 06:45

answered Jun 28 '20 at 18:06

Jelmer

13
6

I tried that, but it says: Error in eval(predvars, data, env) : object 'debt_to_income_ratio' not found – Lauren Jun 28 '20 at 18:26
if you search `?aggregate` you'll see that the data argument is missing. it should be `aggregate(formula, data, FUN)` – Daniel O Jun 28 '20 at 18:44

How do I find median and mean of certain values in a column?

2 Answers2