Create a summary table for continuous variable by categorical variable

Question

I am a beginner in R, and have transitioned from Stata/SPSS to R. I used to run tabular command in Stata to generate summary of continuous variable by grouping variable. Is there any way I can do this?

I searched on SO, and I found this thread: How to get Summary statistics by group

While Hadley's map function did help me provide quartiles, mean and median, but I need more. Specifically, the number of elements in a particular quartile, the number of elements in a particular level of a factor.

Here's dummy code:

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 
           71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
 grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
 df <- data.frame(group=grp, dt=data)

 df %>% 
  data.table::as.data.table(.) %>% 
  split(.,by=c("group"),drop = TRUE,sorted = TRUE) %>% 
  purrr::map(~summary(.$dt))

And

describe(df$group)

gives two different disjointed sets--one only provides descriptive statistics about categorical variable, while the other only provides basic six functions. I need to see what's going on within a quartile.

I am using Hmisc::describe package above.

How can I do this using R? I'd sincerely appreciate any help.

Sample Output:

My sample output would look something like this , but it would be grouped for each of the four levels of categorical variable. This way, I can analyze what's going on with continuous variable for each level of categorical variable. Right now, the output is spread across three different commands, and it harder for me to understand what's happening.

Here are the commands:

 df %>% data.table::as.data.table(.) %>% split(.,by=c("group"),drop = TRUE,sorted = TRUE) %>% purrr::map(~summary(.$dt))
 df %>% data.table::as.data.table(.) %>% split(.,by=c("group"),drop = TRUE,sorted = TRUE) %>% purrr::map(~describe(.$dt))
 df %>% group_by(group) %>% count(quartile = ntile(dt, 4))

[The credit for the third command goes to one of the people who answered this questions.]

`dplyrs` functions are quite easy to follow `group_by` levels and then `summarise` — Mateusz1981, Jan 18 '17 at 07:14
`df %>% group_by(group) %>% count(quartile = ntile(dt, 4))`? What does your desired output look like? — alistaire, Jan 18 '17 at 07:25
@alistaire. Thank you so much for your help. This does help, but I am looking for all summary statistics for one level of a factor. I will add some commentary on sample output. — watchtower, Jan 18 '17 at 07:30
None of your code runs, and you haven't showed any particular results that you're looking for. `df %>% group_by(group) %>% group_by(quartile = ntile(dt, 4), add = TRUE) %>% do(broom::tidy(summary(.$dt)))`? I'm just guessing at this point, because you haven't shown what you want. — alistaire, Jan 18 '17 at 07:46
@alistaire. Thanks so much for your help. I am surprised because my code runs fine on my machine. I double-checked it again. Not sure why it is not working on your machine. I am sorry about this. However, your output is spot on and 95% matches what I am looking for. Is there any way, we can put two functions `summary` and `describe` in `map`? I would just want to add the output of `describe()` to each of your rows. Please let me know if you need any info from my side to investigate why the above code isn't working on your machine. — watchtower, Jan 18 '17 at 07:56
Nm, got it running, but you really need to specify what packages you're using. Assuming that's `Hmisc::describe`, it returns a custom `describe` class that's not easily coerced to a data.frame. It's easier to reconstruct the parts directly within `summarise`. — alistaire, Jan 18 '17 at 08:02
@alistaire - very respectfully, do you mind explaining a bit your last comment: "It's easier to reconstruct the parts directly within `summarise`" I didn't follow at all. I am looking to add the output of `Hmisc::describe()` to your output pasted in the comment window. I think that should be it. — watchtower, Jan 18 '17 at 08:11
You can calculate everything `describe` does manually, and arrange them more naturally in your data.frame. `n` is just `n()`; `missing` is `sum(is.na(...))`; `distinct` is `n_distinct`, etc. You can reconstruct the table with `table` and `prop.table`, though fitting it into a data.frame may require some creativity. More generally, think about what you need, and calculate it. — alistaire, Jan 18 '17 at 08:19

score 0 · Answer 1 · answered Jan 18 '17 at 07:17

0

data <- c(62, 60, 63, 59, 63, 67, 71, 64, 65, 66, 68, 66, 71, 67, 68, 68, 56, 62, 60, 61, 63, 64, 63, 59)
grp <- c(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dt=data)

library(dplyr)

df %>% group_by(group) %>% summarise(mdt = mean(dt, na.rm = T))

answered Jan 18 '17 at 07:17

Mateusz1981

1,817
17
33

Thanks, but how would I know the count of elements in a quartile? – watchtower Jan 18 '17 at 07:17
quantile for the group? `by(df, df$group, summary)` – Mateusz1981 Jan 18 '17 at 07:20
@Mateusz--Sorry if I wasn't clear. I meant number and range of continuous elements in a quartile for a given group. Essentially, I want to compute number and range of elements and in a continuous function, but then do this recursively for each group.Does that make sense? – watchtower Jan 18 '17 at 07:23
Probably it does but I am too short to answer :) or check that `df %>% group_by(group) %>% summarise(mdt = quantile(dt, probs = 0.25, na.rm = T), li = n())`, you cen set the quantile you want by probs, `n()`count the observation – Mateusz1981 Jan 18 '17 at 07:24

Create a summary table for continuous variable by categorical variable

1 Answers1