I have a dataframe --say x -- that feeds a function which returns a subset depending on the value of a column x$id.
This subset y includes a column y$room that contains a different mix of values depending on the x$id value.
The subset is then spread with tidyr and the values of the y$room become columns.
Then the resulting extended df --say ext_y-- must be grouped by a column y_ext$visit and summary statistics should be calculated for the remaining columns by a special function.
The obvious problem is that these columns are not known in advance and therefore can not be defined by their names within the function.
The alternative of using the indexes of the columns instead of the names does not seem to work with dplyr, when group_by is involved.
Do you have any ideas how this problem could be tackled?
The dataframe has many thousands rows, so I will give you only a glimpse:
> tail(y)
id visit room value
11940 14 2 living room 19
11941 14 2 living room 16
11942 14 2 living room 15
11943 14 2 living room 22
11944 14 2 living room 25
11945 14 2 living room 20
> unique(x$id)
[1] 14 20 41 44 46 54 64 74 104 106
> unique(x$visit)
[1] 0 1 2
> unique(x$room)
[1] "bedroom" "living room" "family room" "study room" "den"
[6] "tv room" "office" "hall" "kitchen" "dining room"
> summary(x$value)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.750 7.875 17.410 16.000 1775.000
For a given id the spread() of tidyr returns only a subset of the room values in x. E.g. for id = 54:
> y<- out
> y$row <- 1 : nrow(y)
> y_ext <- spread(y, room, value)
> head(y_ext)
id visit row bedroom family room living room
1 14 0 1 6.00 NA NA
2 14 0 2 6.00 NA NA
3 14 0 3 2.75 NA NA
4 14 0 4 2.75 NA NA
5 14 0 5 2.75 NA NA
6 14 0 6 2.75 NA NA
Now, I must compose a function that groups the result by visit and summarises the columns that are returned for each group in the following form:
visit bedroom family room living room
1 0 NA 2.79 3.25
2 1 NA NA 4.53
3 2 4.19 3.77 NA
As I mentioned above, I do not know in advance which columns will be returned for a given id and this complicates the problem. Of course a short cut would be to check and find out for each id which columns are returned and then create an if structure that directs each id to the appropriate code, but this is not very elegant, I am afraid.
Hope this helped to give you a better picture.