0

Suppose I have following data.frame:

test <- data.frame(id = c(rep(1,6), rep(2,3)),
monthYear = c("Dec16","Nov16","Oct16","Dec16","Nov16","Oct16","Dec16","Nov16","Dec16"),
color = c(rep("black", 4),rep("red",2),rep("black", 2),"red"),
quantity = c(1:9))

which results in this data.frame:

> test
  id monthYear color quantity
1  1     Dec16 black        1
2  1     Nov16 black        2
3  1     Oct16 black        3
4  1     Dec16 black        4
5  1     Nov16   red        5
6  1     Oct16   red        6
7  2     Dec16 black        7
8  2     Nov16 black        8
9  2     Dec16   red        9

For each unique id I need to create a subset of color by monthYear, where each monthYear is placed as a new column and quantity is aggregated accordingly:

library("reshape")
for (i in unique(test$id)) {
        subsetTest <- aggregate(cbind(quantity) ~ monthYear + color,
                                data = test[test$id==i,],
                                sum)
        subsetTestColumns <- reshape(subsetTest,
                                     timevar=as.character("monthYear"),
                                     idvar="color",
                                     direction="wide")
        names(subsetTestColumns)[-1]<-as.character(unique(subsetTest$monthYear))
        subsetTestColumns[is.na(subsetTestColumns)]<-0
        # Reorder by column name (just in case)
        subsetTestColumns[c("color", "Dec16", "Nov16","Oct16")]

        ### Do something with the resulting data.frame
}

For the first Id, the output is as expected:

> subsetTestColumns
  color Dec16 Nov16 Oct16
1 black     5     2     3
4   red     0     5     6

However, the loop fails for id=2 because there are not complete cases (Oct16 does not exist at all). Note that the structure of the resulting data.frame has to be exactly [color, Dec16, Nov16, Oct16]

I probably could check (before naming the monthYear columns) if all of the three monthYears exist, and if not, create them manually. But somehow, I do not feel comfortable with this way.

I assume, that the best way to prevent the error would be adding the missing observations with zero quantities. Therefore, after creating the subsetTest data.frame with the aggregate function, I get following structure for id=2:

  monthYear color quantity
1     Dec16 black        7
2     Nov16 black        8
3     Dec16   red        9

In my opinion, adding 0 quantity to black for Oct16, and 0 quantity to red for Nov16 and Oct16 seems the more logic way to resolve the problem. Can you light me up on how to achieve this, taking into consideration that the original data has more than 100 ids and in each case the missing observations would be totally different?

agustin
  • 1,311
  • 20
  • 42

0 Answers0