Subsetting Data Based on Number of Rows True

Question

In R, I have a dataset with 10 columns of interest. These columns are binary, where 1 means true and 2 means false. I need to subset the data to find the number of times at which only 2 columns equal 1, only 3 columns equal 1, etc. The issue- I don't care which 3 equal 1, I just need data on how many times it occurs that any combination of 3 equals 1.

This is the only way I have been able to find this, but I have to write all this code for each combination, and this is way too tedious with 1023 possible combinations.

a1x3 <- filter(ALL_STATES_TBL, ace3 == "1" & ace4 == "2" & ace5 == "2" & ace6 == "2" & ace7 == "2" & ace8 == "2" & ace9 == "2") a1x3t <- filter(a1x3, ace1 == "3") a1x3tt <- filter(a1x3, ace1 == "4") a1x3 <- rbind(a1x3t, a1x3tt)

Any help would be greatly appreciated! [Also, code is written this way because ace1 has 4 options instead of 2].

jay.sf · Answer 1 · 2023-06-27T05:38:12.240

0

We could iteratively subset by n columns, equal to == 1 and count the rowSums how many times they are n. combn does the all combinations job.

f <- \(x) {
  sq <- seq_along(x)[-1]
  sapply(setNames(sq, sq), \(n) 
         sum(unlist(combn(ncol(x), n, \(j) rowSums(as.data.frame(x)[, j] == 1) == n))))
}

Consider this data frame.

dat1
#   V1 V2 V3 V4
# 1  1  2  1  1
# 2  1  2  2  2
# 3  1  2  1  1
# 4  1  2  2  1
    
# 2 3 4 
# 7 2 0

Or this one

dat2
#   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1  1  2  1  1  1  2  2  2  1   2
# 2  1  2  2  2  1  1  2  1  1   2
# 3  1  2  1  2  1  1  2  2  2   2
# 4  1  1  2  2  1  1  2  2  2   2
# 5  2  2  1  2  1  1  1  2  2   1

f(dat2)
#  2  3  4  5  6  7  8  9 10 
# 42 38 17  3  0  0  0  0  0

Data:

set.seed(42)
dat1 <- as.data.frame(matrix(sample(1:2, 16, replace=TRUE), 4, 4))

set.seed(42)
dat2 <- as.data.frame(matrix(sample(1:2, 50, replace=TRUE), 5, 10))

edited Jun 27 '23 at 05:38

answered Jun 26 '23 at 20:44

jay.sf

60,139
8
53
110

Thanks so much! My problem is "j" seems to need a specific column name, and I'm evaluating 10 columns? It worked when I ran exactly the code you suggested, but I attempted to apply this to my own data and got an error concerning "j". – Ivey Jun 26 '23 at 21:50
@Ivey Make your error reproducible for me, See https://stackoverflow.com/q/5963269/6574038 – jay.sf Jun 26 '23 at 22:08
Example set of data: structure(list(year = c(2016, 2016, 2016, 2017), fipsst = c(1, 2, 2, 3), ace1 = c(1, 2, 3, 4), ace3 = c(1, 2, 1, 2), ace4 = c(1, 1, 1, 2), ace5 = c(1, 2, 2, 2), ace6 = c(1, 1, 1, 1), ace7 = c(1, 2, 1, 2), ace8 = c(1, 2, 2, 2), ace9 = c(1, 1, 1, 2), ace10 = c(2, 2, 2, 2), ace11 = c(2, 2, 1, 2), class = "data.frame"); when I run f() I receive "Error in `[.data.table`(x, , j) : j (the 2nd argument inside [...]) is a single symbol but column name 'j' is not found. Perhaps you intended DT[, ..j]. Difference to data.frame is deliberate. Called from: `[.data.table`(x, , j)" – Ivey Jun 26 '23 at 22:27
I also tried the original f with sum instead of rowSums to combat this error (I removed the year and fips columns as well and renamed all columns ace to get around j error-) code now looks like "f <- \(x) { sq <- seq_along(x)[-1] sapply(setNames(sq, sq), \(n) sum(unlist(combn(ncol(x), n, \(ace) sum(x[, ace, drop = FALSE] == 1) == n)))) }". However, applying this just returns NAs. – Ivey Jun 27 '23 at 04:04
@Ivey You have class `'data.table'` (try `class()`), not sure why it doesn't work with it, but they are slightly different from `'data.frame'`. I've added a line in the function, should work now. – jay.sf Jun 27 '23 at 05:17
@ivey Be sure to only select specific columns to get correct results, e.g. `f(dat[, 3:12])` in your example. – jay.sf Jun 27 '23 at 05:29

Subsetting Data Based on Number of Rows True

1 Answers1