0

I have a tibble/dataframe with

sample_id     condition     state
---------------------------------
sample1       case          val1
sample1       case          val2
sample1       case          val3
sample2       control       val1
sample2       control       val2
sample2       control       val3

The dataframe is generated within a for loop for different states. Hence, every dataframe has a different name for the state column.

I want to group the data by sample_id and calculate the median of the state column such that every unique sample_id has a single median value. The output should be like below...

sample_id     condition     state
---------------------------------
sample1       case          median
sample2       control       median

I am trying the command below; it is working if give the name of the column, but I am not able to pass the name via the state character variable. I tried ensym(state) and !!ensym(state), but they all are throwing errors.

ddply(dat_state, .(sample_id), summarize,  condition=unique(condition), state_exp=median(ensym(state)))
camille
  • 16,432
  • 18
  • 38
  • 60
Rohit Farmer
  • 319
  • 4
  • 15
  • 2
    Just want to point out that `ddply` comes from `plyr`, which has been deprecated for a few years now. You can do this with `dplyr` functions: just group by the first two columns and summarize the third as the median. To do something like this with the column name as a variable, you can use tidyeval in a function. You're trying to do this with something like `x <- "state"`? – camille Dec 30 '19 at 18:19
  • Does this answer your question? [standard evaluation in dplyr: summarise a variable given as a character string](https://stackoverflow.com/questions/26724124/standard-evaluation-in-dplyr-summarise-a-variable-given-as-a-character-string) – camille Dec 30 '19 at 18:25
  • @camille `plyr` has not been deprecated. It is retired, meaning we continue to maintain it on CRAN indefinitely without adding new features. – Lionel Henry Dec 31 '19 at 08:06
  • @LionelHenry true, I used the wrong term. But it's fair to encourage people to move toward the packages taking its place, right? – camille Dec 31 '19 at 13:58
  • 1
    yup totally fair, especially for new packages and scripts. I just wanted to make sure people are not going to get the wrong impression that plyr might disappear from CRAN anytime soon. – Lionel Henry Jan 02 '20 at 08:51

3 Answers3

1

As camille notes above, this is easier in dplyr. Basic syntax (not yet addressing your question):

my_df %>% 
  group_by(sample_id, condition) %>% 
  summarize(state = median(state))

Note that syntax will give you values for every unique sample_id-condition pair. Which isn't an issue in your example, since every sample_id has the same condition, but just something to be aware of.

On to your question... It's not quite clear to me how you're planning to pass the state name to your calculation. But a couple ways you can handle this. One is to use dplyr's "rename" function:

x <- "Massachusetts"
my_df %>% 
  rename(state = x) %>% 
  group_by(sample_id, condition) %>% 
  summarize(state = median(state))

The (probably more proper) way to do this is to write a function using dplyr's "tidyeval" syntax:

myfunc <- function(df, state_name) {
  df %>% 
    group_by(sample_id, condition) %>% 
    summarize(state = median({{state_name}}))
}

myfunc(my_df, Massachusetts) # Note: Unquoted state name
benc
  • 376
  • 1
  • 6
0

Thank you all for putting effort into answering my question. With your suggestions, I have found the solution. Below is the code to what I was trying to achieve by grouping sample_id and condition and passing state through a variable.

state_mark <- c("pPCLg2", "STAT1", "STAT5", "AKT")

for(state in state_mark){
    dat_state <- dat_clust_stim[,c("sample_id", "condition", state)]

    # I had to use !!ensym() to convert a character to a symbol.
    dat_med <- group_by(dat_state, sample_id, condition) %>% 
               summarise(med = median(!!ensym(state)))

    dat_med <- ungroup(dat_med)
    x <- dat_med[dat_med$condition == "case", "med"]
    y <- dat_med[dat_med$condition == "control", "med"]
    t_test <- t.test(x$med, y$med)
}
Rohit Farmer
  • 319
  • 4
  • 15
0

If you want to stay old-fashioned, you can use the eval(parse(text=expression)) idiom:

ddply(dat_state, .(sample_id), summarize, 
      state_exp = eval(parse(text = paste("median(",state,")"))))

No fancy operators but mind the parentheses!

Niels Holst
  • 586
  • 4
  • 9