0

I have a large csv file, and I am trying to find the median and the mean values of certain values in a column. One of my columns is titled 'Race' and another is called 'debt_to_income_ratio'. Within the Race column, the four options are 'White', 'Black', 'Hispanic', and 'Other'. The 'debt_to_income_ratio' column has a number in it indicating the debt to income ratio of whatever the race is in the 'Race' column. I am trying to get a median and mean debt to income ratio for each race (white, black, hispanic, and other).

The code I am currently using is:

df['race average'] = df.groupby('Race')['debt_to_income_ratio'].transform('mean') %>%
df['race median'] = df.groupby('Race')['debt_to_income_ratio'].transform('median')

I'm not really sure what I should be doing, so thanks in advance for any help!

Lauren
  • 55
  • 3
  • 3
    This is python or R ? Seems like a chimera... Can you clarify which programming language is this intended for and also can you share df by doing dput(head(df)) and pasting the output? – StupidWolf Jun 28 '20 at 17:59
  • If this is a question about computing summary statistics by group of one variable, then it is a frequent duplicate. See [1](https://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group), [2](https://stackoverflow.com/questions/6053620/calculate-group-mean-or-other-summary-stats-and-assign-to-original-data). – Rui Barradas Jun 28 '20 at 18:03
  • This is intended for R. – Lauren Jun 28 '20 at 18:10
  • I use the code that the code suggested in 2, which was: group_by(Race) %>% mutate(Race.mean.values = mean(debt_to_income_ratio)) . A new column was created, but it all of the values were NA. – Lauren Jun 28 '20 at 18:17
  • We don't have your data, and your only code is both in python (not R) and not completely correct python code (`%>%`?). Please spend a moment to improve this question to be a minimal reprex, where we have some representative data to play with. (Unambiguous data is best served with `dput(head(x))` or `data.frame(...)`, depending on several factors.) From there, if you have preferences for R "ecosystems" like base, `dplyr`, or `data.table`, please be explicit, otherwise answers might encourage packages with which you are not familiar. – r2evans Jun 28 '20 at 18:39

2 Answers2

1

We can use dplyr to do this

library(dplyr)
df %>%
    group_by(Race) %>%
    mutate(Mean = mean(debt_to_income_ratio, na.rm = TRUE),
           Median = median(debt_to_income_ratio, na.rm = TRUE))
   
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I tried this, and it came up with a column titled mean, but all of the values are NA. Do you have any ideas what would be causing this? – Lauren Jun 28 '20 at 19:37
  • @Lauren if there is any NA element, it returns NA, so you have use `na.rm = TRUE` by default, it is `na.rm = FALSE` in `mean` or `median` – akrun Jun 28 '20 at 19:38
0

An option based on the base R aggregate function. Is this what you mean?

race_median = aggregate(debt_to_income_ratio ~ Race, data = df, FUN = function(x) quantile(x, 0.5, na.rm = T))
race_mean   = aggregate(debt_to_income_ratio ~ Race, data = df, FUN = "mean")
Jelmer
  • 13
  • 6
  • I tried that, but it says: Error in eval(predvars, data, env) : object 'debt_to_income_ratio' not found – Lauren Jun 28 '20 at 18:26
  • if you search `?aggregate` you'll see that the data argument is missing. it should be `aggregate(formula, data, FUN)` – Daniel O Jun 28 '20 at 18:44