How to specify a column name in ddply via character variable?

Question

I have a tibble/dataframe with

sample_id     condition     state
---------------------------------
sample1       case          val1
sample1       case          val2
sample1       case          val3
sample2       control       val1
sample2       control       val2
sample2       control       val3

The dataframe is generated within a for loop for different states. Hence, every dataframe has a different name for the state column.

I want to group the data by sample_id and calculate the median of the state column such that every unique sample_id has a single median value. The output should be like below...

sample_id     condition     state
---------------------------------
sample1       case          median
sample2       control       median

I am trying the command below; it is working if give the name of the column, but I am not able to pass the name via the state character variable. I tried ensym(state) and !!ensym(state), but they all are throwing errors.

ddply(dat_state, .(sample_id), summarize,  condition=unique(condition), state_exp=median(ensym(state)))

Just want to point out that `ddply` comes from `plyr`, which has been deprecated for a few years now. You can do this with `dplyr` functions: just group by the first two columns and summarize the third as the median. To do something like this with the column name as a variable, you can use tidyeval in a function. You're trying to do this with something like `x <- "state"`? — camille, Dec 30 '19 at 18:19
Does this answer your question? [standard evaluation in dplyr: summarise a variable given as a character string](https://stackoverflow.com/questions/26724124/standard-evaluation-in-dplyr-summarise-a-variable-given-as-a-character-string) — camille, Dec 30 '19 at 18:25
@camille `plyr` has not been deprecated. It is retired, meaning we continue to maintain it on CRAN indefinitely without adding new features. — Lionel Henry, Dec 31 '19 at 08:06
@LionelHenry true, I used the wrong term. But it's fair to encourage people to move toward the packages taking its place, right? — camille, Dec 31 '19 at 13:58
yup totally fair, especially for new packages and scripts. I just wanted to make sure people are not going to get the wrong impression that plyr might disappear from CRAN anytime soon. — Lionel Henry, Jan 02 '20 at 08:51

score 1 · Answer 1 · answered Dec 30 '19 at 21:38

As camille notes above, this is easier in dplyr. Basic syntax (not yet addressing your question):

my_df %>% 
  group_by(sample_id, condition) %>% 
  summarize(state = median(state))

Note that syntax will give you values for every unique sample_id-condition pair. Which isn't an issue in your example, since every sample_id has the same condition, but just something to be aware of.

On to your question... It's not quite clear to me how you're planning to pass the state name to your calculation. But a couple ways you can handle this. One is to use dplyr's "rename" function:

x <- "Massachusetts"
my_df %>% 
  rename(state = x) %>% 
  group_by(sample_id, condition) %>% 
  summarize(state = median(state))

The (probably more proper) way to do this is to write a function using dplyr's "tidyeval" syntax:

myfunc <- function(df, state_name) {
  df %>% 
    group_by(sample_id, condition) %>% 
    summarize(state = median({{state_name}}))
}

myfunc(my_df, Massachusetts) # Note: Unquoted state name

score 0 · Accepted Answer · answered Jan 03 '20 at 14:44

Thank you all for putting effort into answering my question. With your suggestions, I have found the solution. Below is the code to what I was trying to achieve by grouping sample_id and condition and passing state through a variable.

state_mark <- c("pPCLg2", "STAT1", "STAT5", "AKT")

for(state in state_mark){
    dat_state <- dat_clust_stim[,c("sample_id", "condition", state)]

    # I had to use !!ensym() to convert a character to a symbol.
    dat_med <- group_by(dat_state, sample_id, condition) %>% 
               summarise(med = median(!!ensym(state)))

    dat_med <- ungroup(dat_med)
    x <- dat_med[dat_med$condition == "case", "med"]
    y <- dat_med[dat_med$condition == "control", "med"]
    t_test <- t.test(x$med, y$med)
}

score 0 · Answer 3 · answered Feb 03 '23 at 09:10

0

If you want to stay old-fashioned, you can use the eval(parse(text=expression)) idiom:

ddply(dat_state, .(sample_id), summarize, 
      state_exp = eval(parse(text = paste("median(",state,")"))))

No fancy operators but mind the parentheses!

answered Feb 03 '23 at 09:10

Niels Holst

586
4
9

How to specify a column name in ddply via character variable?

3 Answers3