R subset data.frame columns by group to maximize row values

Question

I have a problem very similar to that described here:

subset of data.frame columns to maximize "complete" observations

I am trying to schedule a workshop that will meet five times. I have ten days from which to choose meeting dates, each day having three overlapping possible meeting times. Hence, I have 30 columns grouped into ten groups (days) of three columns (meeting times) each. I need to select 5 columns (or meeting date–time combinations) subject to the following criteria: only one meeting time is selected per day (one column per group); the number of respondents (rows) who can attend all 5 meetings is maximized. Ideally, I would also want to know how the optimal column selection changes if I relax the criterion that respondents must attend ALL 5 meetings, requiring only that they attend 4, or 3, etc.

For simple visualization, let's say I want to know which two columns I should choose—no more than one each from V1, V2, and V3—such that I maximize the number of rows that have no zeros (i.e. row sums to 2).

V1A   V1B   V1C   V2A   V2B   V2C   V3A   V3B   V3C  
1     0     1     0     1     1     1     0     1   
1     1     0     0     1     1     0     1     1   
0     0     1     1     1     0     0     1     1   
1     1     1     1     0     0     1     0     0 
1     0     0     0     1     1     0     1     0 
0     1     1     0     1     1     0     0     0 
1     0     1     1     1     0     1     0     1

The actual data are here: https://drive.google.com/file/d/0B03dE9-8088aMklOUVhuV3gtRHc/view Groups are mon1* tue1* [...] mon2* tue2* [...] fri2*.

The code proposed in the link above would solve my problem if it were not the case that I needed to select columns from groups. Ideally, I would also be able to say which columns I should choose to maximize the number of rows under the weaker condition that a row could have one zero (i.e. row sums to 5 or 4 or 3, etc.).

Many thanks!

score 1 · Accepted Answer · answered Feb 07 '15 at 03:40

You could use rowSums to get the index of rows that have greater than or equal to two 1's. (The conditions are not very clear)

  lapply(split(names(df),sub('.$', '', names(df))), 
          function(x) which(rowSums(df[x])>=2))
  #$V1
  #[1] 1 2 4 6 7

  #$V2
  #[1] 1 2 3 5 6 7

  #$V3
  #[1] 1 2 3 7

score 0 · Answer 2 · answered Feb 07 '15 at 06:21

This just finds the first column index with 1 (or very first if all zero) in each of three groups, returning a three column matrix, one column for each group.

f <- substring(colnames(df), 1L, nchar(colnames(df))-1L)
ans <- lapply(split(as.list(df), f),
              function(x) max.col(do.call(cbind, x), ties.method="first"))
do.call(cbind, ans)

score -2 · Answer 3 · answered Feb 07 '15 at 03:14

With your dataset this delivers the rows that satisfy the requirement to deliver all rows==1:

> lapply( 1:3, function(grp) which( apply( dat[, grep(grp, names(dat))] , 1, 
                                           function(z) sum(z, na.rm=TRUE)==3) ) )
[[1]]
[1] 4

[[2]]
integer(0)

[[3]]
integer(0)

If you relax the requirement to allow values less than 3 you get more candidates:

> lapply( 1:3, function(grp) which( apply( dat[, grep(grp, names(dat))] , 1, function(z) sum(z, na.rm=TRUE)>=2) ) )
[[1]]
[1] 1 2 4 6 7

[[2]]
[1] 1 2 3 5 6 7

[[3]]
[1] 1 2 3 7

Now ,,,,,,, what exactly are the rutes for this task?????

R subset data.frame columns by group to maximize row values

3 Answers3