R: sum by count over multiple columns

Question

this is related to this question that I've looked at How to summarize by group?, however, it seems that my data is a little different that makes things weird. I have a data.frame DF like so:

X  Y1  Y2  Y3  Y4
3  A   A   B   A
2  B   B   A   A
1  B   A   A   A

I want to make a sort of weighted sum of each unique factor in Y by its numeric value in X, such that the output is:

Y Y1 Y2 Y3 Y4
A 3  4  3  6 
B 3  2  3  0

I had tried using a for loop to iterate over the indices of the columns, but I wasn't able to pass the number of the Y's correctly, and it didn't seem like the R way of doing this efficiently, for many more columns and rows.

It looks like according to the linked question, this is the right approach, however, when I try to extend to do the same across all the columns, via group_by and summarise_each, I get errors as the Y's are factors. Should I be using 'apply' instead? The logic of this seems straight forward but I've been stumped in its implementation.

aggregate(X~Y1,DF,sum)

thelatemail · Answer 1 · 2016-10-20T07:45:31.833

I don't think this is straightforward, and will require melting and reshaping. Here's an attempt in data.table:

setDT(df)    
dcast(melt(df, id.vars="X", value.name="Y")[,.(X=sum(X)), by=.(variable,Y)], Y ~ variable)
#Using 'X' as value column. Use 'value.var' to override
#   Y Y1 Y2 Y3 Y4
#1: A  3  4  3  6
#2: B  3  2  3 NA

Or maybe even just use xtabs if you want to avoid most of the data.table code:

xtabs(X ~ Y + variable, melt(df, id.vars="X", value.name="Y"))

Or a variation using only base R:

xtabs(X ~ ., cbind(df[1], stack(lapply(df[-1],as.character))) )

score 0 · Answer 2 · answered Oct 20 '16 at 05:23

I couldn't get the above to work, using the data.table package so I just wrote me own function to do it.

#@param x = vector of strings that we want to identify
#@param DF = data frame, with the first column as weights and the rest containing strings

#@return a matrix with the same cols and rows as identifiers. contains the summed weights

return_reshape = function(x , DF) {
    store_mat = matrix(0.0,length(x),ncol(DF) - 1)
    dimnames(store_mat) = list(x,colnames(DF)[-1])
    for (row in 1:nrow(DF)) {
        for (index in 1:length(x)) {
            col_index = DF[row,-1] == x[index ]
            store_mat[index ,col_index] = store_mat[index ,col_index] + as.numeric(DF[row,1])
    }
}
store_mat
}

DF = data.frame(X=3:1, Y1 = c("A","B","B"),Y2 = c("A","B","A"),Y3 = c("B","A","A"),Y4 = c("A","A","A"),stringsAsFactors=FALSE)
x = as.character(c("A", "B"))
return_reshape(x,DF)
  Y1 Y2 Y3 Y4
A  3  4  3  6
B  3  2  3  0

I didn't see the `xtabs` solution by @thelatemail that is nice — Cyrillm_44, Oct 20 '16 at 05:29
Try making `df` a `data.table` object first via `setDT(df)` - I've just tested the code again and it works fine. — thelatemail, Oct 20 '16 at 07:47

score 0 · Answer 3 · answered Oct 20 '16 at 07:31

This is actually one matrix %*% another matrix.

X = matrix(c(3,2,1), nrow = 1)
X
     [,1] [,2] [,3]
[1,]    3    2    1

Y_A = matrix(c(1,1,0,1,0,0,1,1,0,1,1,1), nrow = 3, byrow = T)
Y_A

     [,1] [,2] [,3] [,4]
[1,]    1    1    0    1
[2,]    0    0    1    1
[3,]    0    1    1    1

Y_B =  1- Y_A
Y_B

     [,1] [,2] [,3] [,4]
[1,]    0    0    1    0
[2,]    1    1    0    0
[3,]    1    0    0    0

X %*% Y_A
     [,1] [,2] [,3] [,4]
[1,]    3    4    3    6

X %*% Y_B
     [,1] [,2] [,3] [,4]
[1,]    3    2    3    0

R: sum by count over multiple columns

3 Answers3