1

I have a data frame with lots of columns. For example:

sample treatment col5 col6 col7
  1        a       3    0   5  
  2        a       1    0   3
  3        a       0    0   2
  4        b       0    1   1

I want to select the sample and treatment columns plus all columns that meet the following 2 conditions:

  1. Their value on the row in which treatment == 'b' is 0
  2. Their value from at least one row where treatment == 'a' is not 0.

The expected result should look like this:

sample treatment col5
  1        a       3      
  2        a       1      
  3        a       0      
  4        b       0       

Example dataframe:

structure(list(sample = 1:4, treatment = structure(c(1L, 1L, 
1L, 2L), .Label = c("a", "b"), class = "factor"), col5 = c(3, 
1, 0, 0), col6 = c(0, 0, 0, 1), col7 = c(5, 3, 2, 1)), class = "data.frame", row.names = c(NA, 
-4L))
divibisan
  • 11,659
  • 11
  • 40
  • 58
Newfarmer
  • 13
  • 5
  • 4
    Possible duplicate of [Filter data.frame rows by a logical condition](https://stackoverflow.com/questions/1686569/filter-data-frame-rows-by-a-logical-condition); To OP adding an additional condition to the duplicate question simply add an `&` between conditions, writing each condition as `data[cond1 & cond2, ]` – Oliver Aug 14 '19 at 21:51
  • Sorry, I am new for this website, and what do you mean with the comment? I checked the question you mentioned, but that is to select rows not columns. – Newfarmer Aug 14 '19 at 21:56
  • Sorry i made a quickie. As seen in Shree's answer, selecting columns can be done by switching from `data[cond1 & cond2, ]` to `data[, cond1 & cond2]` (notice the position of the comma). Basically you index `data.frame`s, `matrix`'s as `data[row, column]`, where `row` is a condition or row numbers you'd like to extract and similar for the column argument. Check out [Hadley's Advanced R](https://adv-r.hadley.nz/) online book. Despite it's name the first looking at necessary chapters such as `subsetting` can get you far, without having to read the advanced stuff. – Oliver Aug 14 '19 at 22:00

2 Answers2

3

Here's a way in base R -

cs_a <- colSums(df[df$treatment == "a",-c(1:2)]) > 0
cs_b <- colSums(df[df$treatment == "b",-c(1:2)]) == 0

df[, c(TRUE, TRUE, cs_a & cs_b)]

  sample treatment col5
1      1         a    3
2      2         a    1
3      3         a    0
4      4         b    0

With dplyr -

df %>% 
  select_at(which(c(TRUE, TRUE, cs_a & cs_b)))
Shree
  • 10,835
  • 1
  • 14
  • 36
0

Here is much more verbose way in tidyverse that does not require manual colSums for each level of treatment:

library(dplyr)
library(purrr)
library(tidyr)

sample <- 1:4
treatment <- c("a", "a", "a", "b")
col5 <- c(3,1,0,0)
col6 <- c(0,0,0,1)
col7 <- c(5,3,2,1)

dd <- data.frame(sample, treatment, col5, col6, col7)
# first create new columns that report whether the entries are zero
dd2 <- mutate_if(
  .tbl = dd,
  .predicate = is.numeric,
  .funs = function(x)
    x == 0
)

# then find the sum per column and per treatment group
# in R TRUE = 1 and FALSE = 0
number_of_zeros <- dd2 %>% 
  group_by(treatment) %>% 
  summarise_at(.vars = vars(col5:col7), .funs = "sum")

# then find the names of the columns you want to keep
keeper_columns <-
  number_of_zeros %>% 
  select(-treatment) %>% # remove the treatment grouping variable
  map_dfr( # function to check if all entries per column (now per treatment level) are greater zero
    .x = .,
    .f = function(x)
      all(x > 0)
  ) %>% 
  gather(column, keeper) %>% # reformat 
  filter(keeper == TRUE) %>% # to grab the keepers
  select(column) %>% # then select the column with column names
  unlist %>% # and convert to character vector
  unname

# subset the original dataset for the wanted columns
wanted_columns <- dd %>% select(1:2, keeper_columns)
teofil
  • 2,344
  • 1
  • 8
  • 17