4

I would like to reference a column inside the summarise() in dplyr with its index rather than with its name. For example:

        > a

           id visit timepoint bedroom  den
            1   0     0        62      NA 
            2   1     0        53    6.00  
            3   2     0        56    2.75   
            4   0     1        55      NA 
            5   1     2        61      NA 
            6   2     0        54      NA 
            7   0     1        58    2.75   
            8   1     2        59      NA 
            9   2     2        60      NA 
            10  0     1        57      NA 

           # E.g. 
           a %>% group_by(visit) %>% summarise(avg.bedroom = mean(bedroom, na.rm   =T)
           # Returns
        visit avg.dedroom
        <dbl>       <dbl>
     1     0       4.375
     2     1       2.750
     3     2         NaN

How could I use the index of column "bedroom" rather its name in the summarise clause? I tried:

     a %>% group_by(visit) %>% summarise("4" = mean(.[[4]], na.rm = T))

but this returned false results:

       visit      `4`
        <dbl>    <dbl>
      1     0 3.833333
      2     1 3.833333
      3     2 3.833333

Is my objective achievable and if yes how? Thank you.

pv7
  • 95
  • 3
  • 7
  • 1
    You may find this relevant http://stackoverflow.com/questions/32618744/dplyr-how-to-reference-columns-by-column-index-rather-than-column-name-using-mu – MFR Nov 07 '16 at 23:32
  • 1
    In the answer @MFR linked a commenter states that the `.[[4]]` syntax does not work with `group_by`. This sadly only explains why it does not work :). – Paul Hiemstra Nov 08 '16 at 07:51

2 Answers2

1

Perhaps not exactly what you're looking for, but one option would be to use purrr rather than dplyr. Something like

# Read in data
d <- read.table(textConnection(" id visit timepoint bedroom  den
        1  12     0        62      NA 
        2  14     0        53    6.00  
        3  14     0        56    2.75   
        4  14     1        55      NA 
        5  14     2        61      NA 
        6  15     0        54      NA 
        7  15     1        58    2.75   
        8  16     2        59      NA 
        9  16     2        60      NA 
        10 17     1        57      NA "), 
    header = TRUE)


library(purrr)

d %>% 
    split(.$timepoint) %>% 
    map_dbl(function(x) mean(x[ ,5], na.rm = TRUE))

#     0     1     2 
# 4.375 2.750   NaN 

Or, with base

aggregate(d[ ,5] ~ timepoint, data = d, mean)

#   timepoint d[, 5]
# 1         0  4.375
# 2         1  2.750
Daniel Anderson
  • 2,394
  • 13
  • 26
  • Your code works but gives aggregate results. I would like to group the results by "visit" and show the "visit" column on the left. – pv7 Nov 08 '16 at 08:33
  • So, given that desired output, doesn't aggregate work perfectly? i.e. `aggregate(hp ~ cyl, mtcars, mean)`. Another option would be to add `%>% as.data.frame()` to the code above, which would list the grouping variable as the row names. – Daniel Anderson Nov 08 '16 at 16:37
  • I see your point and it is valid. However, I think that my situation is a little bit more complex than the example you are using. In my case there are two categorical variables. First, the values in the one categorical variable must be spread horizontally and second the values should be grouped by a second categorical variable. While in your example there is only one categorical variable at play, the cyl. – pv7 Nov 08 '16 at 20:00
  • Check out the edit. I read in your data and did the same thing. Does that work? – Daniel Anderson Nov 08 '16 at 20:26
0

The answer I found is the summarize_at() function of dplyr. Here is how I used summarize_at() to create summary statistics on subsets of my dataframe where the columns were not known in advance (object is my original dataframe which is in a long form and has a column -- room -- that contains the names of the rooms, as well as two other columns, "visit" and "value"):

          # Convert object to a wide form

          object$row <- 1 : nrow(object)

          y <- spread(object, room, value)


          # Remove the row column from y

          y <- y %>% select(-row)

          # Initialize stat1, the dataframe with the summary
          # statistics

          stat1 <- data.frame(visit = c(0, 1, 2))

          # Find the number of columns that stat1 will eventually
          # have

          y <- y %>% filter(id == id) %>% 
              select_if(function(col) mean(is.na(col)) != 1) 

          n <- ncol(y)

          # Append columns with summary statistics to stat1

          for (i in 3 : n) {
              t <- y %>% group_by(visit) %>% 
                  summarise_at(c(i), mean, na.rm = T)

              t[, 2] <- round(t[, 2], 2)

              stat1 <- cbind(stat1, t[, 2])
          }

          # Pass the dataframe stat1 to the list "results"

          results$stat1 <- stat1
pv7
  • 95
  • 3
  • 7