Grouping and summarising when the columns returned are not known in advance

Question

I have a dataframe --say x -- that feeds a function which returns a subset depending on the value of a column x$id.

This subset y includes a column y$room that contains a different mix of values depending on the x$id value.

The subset is then spread with tidyr and the values of the y$room become columns.
Then the resulting extended df --say ext_y-- must be grouped by a column y_ext$visit and summary statistics should be calculated for the remaining columns by a special function.

The obvious problem is that these columns are not known in advance and therefore can not be defined by their names within the function.

The alternative of using the indexes of the columns instead of the names does not seem to work with dplyr, when group_by is involved.

Do you have any ideas how this problem could be tackled?

The dataframe has many thousands rows, so I will give you only a glimpse:

       > tail(y)
           id visit        room value
     11940 14     2 living room    19
     11941 14     2 living room    16
     11942 14     2 living room    15
     11943 14     2 living room    22
     11944 14     2 living room    25
     11945 14     2 living room    20

     > unique(x$id)
    [1]  14  20  41  44  46  54  64  74 104 106
     > unique(x$visit)
    [1] 0 1 2
     > unique(x$room)
     [1] "bedroom"      "living room"  "family  room" "study room"   "den"         
     [6] "tv room"      "office"       "hall"         "kitchen"      "dining room" 
     > summary(x$value)
         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
        2.000    2.750    7.875   17.410   16.000 1775.000

For a given id the spread() of tidyr returns only a subset of the room values in x. E.g. for id = 54:

  > y<- out
  > y$row <- 1 : nrow(y)
  > y_ext <- spread(y, room, value)
  > head(y_ext)
       id visit row bedroom family  room living room
     1 14     0   1    6.00           NA          NA
     2 14     0   2    6.00           NA          NA
     3 14     0   3    2.75           NA          NA
     4 14     0   4    2.75           NA          NA
     5 14     0   5    2.75           NA          NA
     6 14     0   6    2.75           NA          NA

Now, I must compose a function that groups the result by visit and summarises the columns that are returned for each group in the following form:

         visit    bedroom    family room   living room
      1   0         NA            2.79         3.25
      2   1         NA             NA          4.53
      3   2         4.19           3.77        NA

As I mentioned above, I do not know in advance which columns will be returned for a given id and this complicates the problem. Of course a short cut would be to check and find out for each id which columns are returned and then create an if structure that directs each id to the appropriate code, but this is not very elegant, I am afraid.

Hope this helped to give you a better picture.

Welcome to Stack Overflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. — zx8754, Nov 08 '16 at 13:02
instead of telling us the problem, it's customary to give an example dataset... that helps solve problems. In the absence of that... are there always the same number of columns selected? are there any patterns in the names that could be used for example to select by features? Are there scenarios that are common? Modify your problem with all of the above and we'll see what we can do. — Amit Kohli, Nov 08 '16 at 13:07
Thank you for the feedback. I have added code and data. I hope this will help you comprehend better the problem. Please let me know if I can improve the question further. Moreover, if possible improve the grading of the question so that I will not face a problem in putting new questions in the future. — pv7, Nov 08 '16 at 16:47
This looks like a really interesting potential problem. However, there is still not enough information to build anything. What things change? What is actually present in the dataset? What do you need the output to contain? Take a look [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more info on making a reproducible example in R. — Mark Peterson, Nov 08 '16 at 19:40

Mark Peterson · Accepted Answer · 2016-11-10T12:41:13.993

Alright, this was interesting enough to me that I made some sample data myself:

nSamples <- 50

allRooms <-
  c("Living", "Dining", "Bedroom", "Master", "Family", "Garage", "Office")

set.seed(8675309)

df <-
  data_frame(
    id = sample(1:5, nSamples, TRUE)
    , visit = sample(1:3, nSamples, TRUE)
    , room = sample(allRooms, nSamples, TRUE)
    , value = round(rnorm(nSamples, 20, 5))
  )

The way I see it, there are three approaches, in ascending order of reasonabality. First option, is to follow your basic layout. Here, I am splitting the df by the id, spreading as instructed, then using summarise_all to do the summation, removing the need to identify room names explicitly.

df %>%
  split(.$id) %>%
  lapply(function(x){
    x %>%
      select(-id) %>%
      mutate(row = 1:n()) %>%
      spread(room, value) %>%
      select(-row) %>%
      group_by(visit) %>%
      summarise_all(sum, na.rm = TRUE)
  })

This returns the following (note unique columns):

$`1`
# A tibble: 3 × 6
  visit Bedroom Dining Garage Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1       0     27     27      0      0
2     2      22     19      0     20     23
3     3       0      0      0     27      0

$`2`
# A tibble: 3 × 6
  visit Bedroom Dining Family Living Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      15      0      0      0     17
2     2       0     14     42     30      0
3     3      15     13     18      0     20

$`3`
# A tibble: 3 × 6
  visit Bedroom Dining Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      24      0     36      0     28
2     2       0      0     15     30      0
3     3       0     25     21      0     15

$`4`
# A tibble: 3 × 7
  visit Bedroom Dining Garage Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1       0      0     23     20      0     24
2     2       0     28     22      0      0      0
3     3      24      0     36      0     16      0

$`5`
# A tibble: 3 × 8
  visit Bedroom Dining Family Garage Living Master Office
  <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1      23      0      0     21      0     16      0
2     2      44     14     41      0     26      0     18
3     3      21     19      0      0     25     19      0

However, because you had to add the row in to get a spread to work (without it, there are not unique entries), the spread actually doesn't help. You can get the same thing a lot more easily if you do the summarising first, like so:

df %>%
  split(.$id) %>%
  lapply(function(x){
    x %>%
      select(-id) %>%
      group_by(visit, room) %>%
      summarise(Sum = sum(value)) %>%
      spread(room, Sum, 0)
  })

Note that it gives 0 for rooms with no visits because of that last 0 for the fill argument. If you would rather that returns NA, you can leave the default.

Finally, it is unclear why you would want to do this separately in the first place. It may make far more sense to just do this all in one big group_by and handle the missings as needed after the fact. To wit, here is a lot less code to get the same summaries.

df %>%
  group_by(id, visit, room) %>%
  summarise(sum = sum(value)) %>%
  spread(room, sum)

gives

      id visit Bedroom Dining Family Garage Living Master Office
*  <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1      1     1      NA     27     NA     27     NA     NA     NA
2      1     2      22     19     NA     NA     NA     20     23
3      1     3      NA     NA     NA     NA     NA     27     NA
4      2     1      15     NA     NA     NA     NA     NA     17
5      2     2      NA     14     42     NA     30     NA     NA
6      2     3      15     13     18     NA     NA     NA     20
7      3     1      24     NA     NA     NA     36     NA     28
8      3     2      NA     NA     NA     NA     15     30     NA
9      3     3      NA     25     NA     NA     21     NA     15
10     4     1      NA     NA     NA     23     20     NA     24
11     4     2      NA     28     NA     22     NA     NA     NA
12     4     3      24     NA     NA     36     NA     16     NA
13     5     1      23     NA     NA     21     NA     16     NA
14     5     2      44     14     41     NA     26     NA     18
15     5     3      21     19     NA     NA     25     19     NA

If you want to filter down to just one id, use filter after the fact, then remove columns with all NA entries. (Note, you would likely save the output once, then pass it through the last two lines once for each id of interest, e.g., when printing)

df %>%
  group_by(id, visit, room) %>%
  summarise(sum = sum(value)) %>%
  spread(room, sum) %>%
  filter(id == 1) %>%
  select_if(function(col) mean(is.na(col)) != 1)

gives

     id visit Bedroom Dining Garage Master Office
  <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     1     1      NA     27     27     NA     NA
2     1     2      22     19     NA     20     23
3     1     3      NA     NA     NA     27     NA

Mark thank you for this thorough answer. When I run your code I get some error messages. When I run the first part, I get the error message "Error in n() : This function should not be called directly". When I run the second and third part I get the error message "Error: Key column 'room' does not exist in input. ". This error appears after the summarise clause. By the way, I have posted a question with the title: "Defining a class in R using the Reference Classes paradigm". Would it be possible if you could have a look and see if you could provide an answer? — pv7, Nov 09 '16 at 23:04
It appears that I missed including the part where I set the number of samples to include in the reproducible data that I made for you. After I include that though (now in the question), everything else runs as expected. Are you trying to run individual pieces separately? — Mark Peterson, Nov 10 '16 at 12:43

Grouping and summarising when the columns returned are not known in advance

1 Answers1