Suppose I have following data.frame
:
test <- data.frame(id = c(rep(1,6), rep(2,3)),
monthYear = c("Dec16","Nov16","Oct16","Dec16","Nov16","Oct16","Dec16","Nov16","Dec16"),
color = c(rep("black", 4),rep("red",2),rep("black", 2),"red"),
quantity = c(1:9))
which results in this data.frame:
> test
id monthYear color quantity
1 1 Dec16 black 1
2 1 Nov16 black 2
3 1 Oct16 black 3
4 1 Dec16 black 4
5 1 Nov16 red 5
6 1 Oct16 red 6
7 2 Dec16 black 7
8 2 Nov16 black 8
9 2 Dec16 red 9
For each unique id
I need to create a subset of color
by monthYear
, where each monthYear is placed as a new column and quantity is aggregated accordingly:
library("reshape")
for (i in unique(test$id)) {
subsetTest <- aggregate(cbind(quantity) ~ monthYear + color,
data = test[test$id==i,],
sum)
subsetTestColumns <- reshape(subsetTest,
timevar=as.character("monthYear"),
idvar="color",
direction="wide")
names(subsetTestColumns)[-1]<-as.character(unique(subsetTest$monthYear))
subsetTestColumns[is.na(subsetTestColumns)]<-0
# Reorder by column name (just in case)
subsetTestColumns[c("color", "Dec16", "Nov16","Oct16")]
### Do something with the resulting data.frame
}
For the first Id, the output is as expected:
> subsetTestColumns
color Dec16 Nov16 Oct16
1 black 5 2 3
4 red 0 5 6
However, the loop
fails for id=2 because there are not complete cases (Oct16 does not exist at all). Note that the structure of the resulting data.frame has to be exactly [color, Dec16, Nov16, Oct16]
I probably could check (before naming the monthYear columns) if all of the three monthYears exist, and if not, create them manually. But somehow, I do not feel comfortable with this way.
I assume, that the best way to prevent the error would be adding the missing observations with zero quantities. Therefore, after creating the subsetTest
data.frame with the aggregate
function, I get following structure for id=2:
monthYear color quantity
1 Dec16 black 7
2 Nov16 black 8
3 Dec16 red 9
In my opinion, adding 0 quantity to black for Oct16, and 0 quantity to red for Nov16 and Oct16 seems the more logic way to resolve the problem. Can you light me up on how to achieve this, taking into consideration that the original data has more than 100 ids and in each case the missing observations would be totally different?