2

So data looks like below. 60000 instances of 93 variables. I want to calculate the number of zeros in the first 4 variables, then the number of zeros in the next 4 variables, ... all the way to the 93rd variable. Currently I have

idx1<-c(1:4)

Z1<-rowSums(Pds[idx1]==0) 

To make the above work, I will need to copy and paste 20 times and alter the code for each variable group. Is there an easier way? I will also being doing this for different combinations of variables,i.e., every 3 variables, every 10 variables, every two. I am saving all of these to new variables. If anyone is wondering, I'm doing the Kaggle Otto group challenge for my data mining class final project. As usual, Thanks to everyone who helps.

 df=    feat_1  feat_2  feat_3  feat_4....
          1       0        0      0
          0       0        0      0
          0       0        0      0
          1       0        0      1
          0       0        0      0
          2       1        0      0
          2       0        0      0
          .        .         .       .
          .        .         .       .
          .        .         .       .
          .        .         .       .
mbs1
  • 317
  • 1
  • 3
  • 12

2 Answers2

3

Let's start with some sample data.

# Sample data
set.seed(144)
dat <- matrix(sample(0:1, 100, replace=TRUE), 10, 10)

Once you split the column identifiers as you want them, you won't have far to go. Luckily, this has been addressed on SO before.

# Split into groups of 4
split(seq(ncol(dat)), ceiling(seq(ncol(dat))/4))
# $`1`
# [1] 1 2 3 4
# 
# $`2`
# [1] 5 6 7 8
# 
# $`3`
# [1]  9 10

Now all you need to do is call rowSums with the columns in each grouping to get the desired count, combining the results into a matrix. sapply is convenient for this:

grouped.sum <- function(dat, size) sapply(split(seq(ncol(dat)), ceiling(seq(ncol(dat))/size)), function(x) rowSums(dat[,x,drop=F] == 0))
grouped.sum(dat, 3)
#       1 2 3 4
#  [1,] 2 1 1 0
#  [2,] 2 2 2 1
#  [3,] 0 2 3 0
#  [4,] 1 1 2 0
#  [5,] 3 2 1 0
#  [6,] 1 2 0 0
#  [7,] 2 1 2 1
#  [8,] 1 2 2 0
#  [9,] 1 2 1 1
# [10,] 2 1 1 1
grouped.sum(dat, 4)
#       1 2 3
#  [1,] 2 1 1
#  [2,] 3 2 2
#  [3,] 1 3 1
#  [4,] 1 2 1
#  [5,] 4 2 0
#  [6,] 2 1 0
#  [7,] 3 2 1
#  [8,] 1 3 1
#  [9,] 2 1 2
# [10,] 2 2 1
Community
  • 1
  • 1
josliber
  • 43,891
  • 12
  • 98
  • 133
3

rowsum is good for this - you transpose your matrix then split the rows by a grouping variable (this is equivalent to grouping by the columns)

n <- 4

idx <- rep(1:ceiling(ncol(dat)/n), each=n, length=ncol(dat))

t(rowsum(t(!dat)*1, idx)
user20650
  • 24,654
  • 5
  • 56
  • 91