filtering with multiple conditions on many columns using dplyr

Question

I've searched on SO trying to find a solution to no avail. So here it is. I have a data frame with many columns, some of which are numerical and should be non-negative. I want to clean the data since some values in these numerical columns are negative. What I can do now is extract the column names of these columns with a regular expression. But I am not sure how to implement the filtering of rows based on these columns.

To give an example, let's say:

library(dplyr)
df <- read.table(text = 
  "id   sth1    tg1_num   sth2    tg2_num    others   
  1     dave    2         ca      35         new
  2     tom     5         tn      -3         old
  3     jane    -3        al       0         new
  4     leroy   0         az      25         old
  5     jerry   4         mi      55        old", header=TRUE)
pattern <- "_num$"
ind <- grep(pattern, colnames(df))
target_columns <- colnames(df)[ind]
df <- df %>% filter(target_columns >= 0) # it's is wrong, but it's what I want to do

What I want to get out from this filtering is the following:

id   sth1 tg1_num   sth2 tg2_num others
1    dave       2     ca      35    new
4   leroy       0     az      25    old
5   jerry       4     mi      55    old

where rows no. 2 and 3 are filtered out because at least one column in tg1_num and tg2_num for these rows contain negative numbers.

What is the desired output? Do you want the whole data set back or just these ones matching the pattern? Do you want both columns to be greater or equal zero, or only one enough? Please show us the final product — David Arenburg, Aug 19 '15 at 19:43
@Vlo that is a way to select the target columns. But does not solve my problem — breezymri, Aug 19 '15 at 19:48

David Arenburg · Answer 1 · 2015-08-19T21:23:45.107

Here's a possible vectorized solution

ind <- grep("_num$", colnames(df))
df[!rowSums(df[ind] < 0),]
#   id  sth1 tg1_num sth2 tg2_num others
# 1  1  dave       2   ca      35    new
# 4  4 leroy       0   az      25    old
# 5  5 jerry       4   mi      55    old

The idea here is to create a logical matrix using the < function (it is a generic function which has data.frame method - which means it returns a data frame like structure back). Then, we are using rowSums to find if there were any matched conditions (> 0 - matched, 0- not matched). Then, we are using the ! function in order to convert it to a logical vector: >0 becomes TRUE, while 0 becomes FALSE. Finally, we are subsetting according to that vector.

Thanks. It is a good and intuitive solution. I accepted the answer by @user295691 because I think his answer is comprehensive. The use of rowMins also is probably faster than the use of rowSums. — breezymri, Aug 19 '15 at 23:33

user295691 · Accepted Answer · 2015-08-19T21:10:30.613

This is a very awkward use of dplyr, but might be true to the spirit

> df %>% mutate(m = do.call(pmin, select(df, ends_with("_num"))))
  id  sth1 tg1_num sth2 tg2_num others  m
1  1  dave       2   ca      35    new  2
2  2   tom       5   tn      -3    old -3
3  3  jane      -3   al       0    new -3
4  4 leroy       0   az      25    old  0
5  5 jerry       4   mi      55    old  4

From there you can add a filter(m >= 0) to get the answer you want. If there were a rowMins analogous to rowMeans then that would simplify this significantly.

> rowMins <- function(df) { do.call(pmin, df) }
> df %>% mutate(m = rowMins(select(df, ends_with("_num"))))
  id  sth1 tg1_num sth2 tg2_num others  m
1  1  dave       2   ca      35    new  2
2  2   tom       5   tn      -3    old -3
3  3  jane      -3   al       0    new -3
4  4 leroy       0   az      25    old  0
5  5 jerry       4   mi      55    old  4

I don't know how efficient this is, though. And nesting the select seems real ugly.

EDIT3: Using ideas cribbed from other solutions/comments (h/t to @Vlo) I can speed mine up a lot (unfortunately, a similar optimization speeds up @Vlo's solution even more (EDIT4: Whoops, misread the chart, I am the fastest, ok, no more on this))

df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}

EDIT: out of curiosity, did some microbenchmarking on some of the solutions (EDIT2: Added more solutions)

microbenchmark(rowmins(df), rowmins2(df), reducer(df), sapplyer(df), grepapply(df), tchotchke(df), withrowsums(df), reducer2(df))

Unit: microseconds
            expr       min         lq      mean    median        uq       max
     rowmins(df)  1373.452  1431.9700  1732.188  1576.043  1729.410  5147.847
    rowmins2(df)   836.885   875.9900  1015.364   913.285  1038.729  2510.339
     reducer(df)   990.096  1058.6645  1217.264  1201.159  1297.997  3103.809
    sapplyer(df) 14119.236 14939.8755 16820.701 15952.057 16612.709 66023.721
   grepapply(df) 12907.657 13686.2325 14517.140 14485.520 15146.294 17291.779
   tchotchke(df)  2770.818  2939.6425  3114.233  3036.926  3172.325  4098.161
 withrowsums(df)  1526.227  1627.8185  1819.220  1722.430  1876.360  3025.095
    reducer2(df)   900.524   943.1265  1087.025  1003.820  1109.188  3869.993

And here are the definitions I used

rowmins <- function(df) {
  df %>%
    mutate(m = rowMins(select(df, ends_with("_num")))) %>%
    filter(m >= 0) %>%
    select(-m)
}

rowmins2 <- function(df) {
  df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
}

reducer <- function(df) {
  df %>%
    select(matches("_num$")) %>%
    lapply(">=", 0) %>%
    Reduce(f = "&", .) %>%
    which %>%
    slice(.data = df)
}

reducer2 <- function(df) {
  df %>%
    select(matches("_num$")) %>%
    lapply(">=", 0) %>%
    Reduce(f = "&", .) %>%
    {df[.,]}
}

sapplyer <- function(df) {
  nums <- sapply(df, is.numeric)
  df[apply(df[, nums], MARGIN=1, function(x) all(x >= 0)), ]
}

grepapply <- function(df) {
  cond <- df[, grepl("_num$", colnames(df))] >= 0
    df[apply(cond, 1, function(x) {prod(x) == 1}), ]
}

tchotchke <- function(df) {
  pattern <- "_num$"
  ind <- grep(pattern, colnames(df))
  target_columns <- colnames(df)[ind]
  desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
  as.vector(unique(unlist(desired_rows)))
}

withrowsums <- function(df) {
  df %>% mutate(m=rowSums(select(df, ends_with("_num"))>0)) %>% filter(m==2) %>% select(-m)
}


df <- data.frame(id=1:10000, sth1=sample(LETTERS, 10000, replace=T), tg1_num=runif(10000,-1,1), tg2_num=runif(10000,-1, 1))

Check this one out. `df` has to contain negative and positive integer (no small decimals) values like OP's sample dataset. `superreducer <- function(df) { df %>% select(matches("_num$")) %>% Reduce(bitwOr, .) %>% {.>=0L} %>% which %>% slice(.data = df) }` is 20% faster than `reducer()` — Vlo, Aug 19 '15 at 21:47
What is the meaning of the curly brackets around `{df[. >= 0,]}`? — dpprdan, Aug 22 '16 at 08:21
@dapperdan: It's a magrittr quirk; if you run a pipe to an anonymous block (i.e. enclosed in braces) then the result of the previous command is stored in `.` -- so `3 %>% { . + 1 }` yields `4`. So this is saying filter for all rows where the previous element is greater than zero. — user295691, Aug 30 '16 at 14:32

aosmith · Answer 3 · 2017-08-17T16:20:30.857

I wanted to see this was possible using standard evaluation with dplyr's filter_. It turns out it can be done with the help of interp from lazyeval, following the example code on this page. Essentially, you have to create a list of the interp conditions which you then pass to the .dots argument of filter_.

library(lazyeval)

dots <- lapply(target_columns, function(cols){
    interp(~y >= 0, .values = list(y = as.name(cols)))
})

filter_(df, .dots = dots)   

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

Update

Starting with dplyr_0.7, this can be done directly with filter_at and all_vars (no lazyeval needed).

df %>%
     filter_at(vars(target_columns), all_vars(. >= 0) )

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

Vlo · Answer 4 · 2015-08-19T21:04:42.913

2

Here is my ugly solution. Suggestions/criticisms welcome

df %>% 
  # Select the columns we want
  select(matches("_num$")) %>%
  # Convert every column to logical if >= 0
  lapply(">=", 0) %>%
  # Reduce all the sublist with AND 
  Reduce(f = "&", .) %>%
  # Convert the one vector of logical into numeric
  # index since slice can't deal with logical. 
  # Can simply write `{df[.,]}` here instead,
  # which is probably faster than which + slice
  # Edit: This is not true. which + slice is faster than `[` in this case
  which %>%
  slice(.data = df)

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

edited Aug 19 '15 at 21:04

answered Aug 19 '15 at 20:31

Vlo

3,168
13
27

Seems this is the fastest proposed so far; some benchmarks below in my answer. – user295691 Aug 19 '15 at 20:54
@user295691 Nvm, looks like which + slice beats out base R subset by a very significant margin – Vlo Aug 19 '15 at 21:04

score 1 · Answer 5 · answered Aug 19 '15 at 20:11

Using base R to get your result

cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
4  4 leroy       0   az      25    old
5  5 jerry       4   mi      55    old

Edit: this assumes you have multiple columns with "_num". It won't work if you have just one _num column

score 1 · Answer 6 · answered Aug 19 '15 at 20:27

First we create an index of all numeric columns. Then we subset all columns greater or equal than zero. So there is no need to check the column names, and the column id will be always positive.

nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN = 1, function(x) all(x >= 0)), ]

Output:

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
4  4 leroy       0   az      25    old
5  5 jerry       4   mi      55    old

Tchotchke · Answer 7 · 2015-08-19T21:00:19.540

0

This will give you a vector of your rows that are less than 0:

desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
desired_rows <- as.vector(unique(unlist(desired_rows)))

Then to get a df of your desired rows:

setdiff(df, df[desired_rows,])
  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

edited Aug 19 '15 at 21:00

answered Aug 19 '15 at 19:47

Tchotchke

3,061
3
22
37

This looks like will work. However, I would like to avoid using loops. My data is quite big and it will probably run very slow. – breezymri Aug 19 '15 at 19:51
@Tchotchke Just out of interest, do you think one could also use `filter(...)` in your first line of code? – maj Aug 19 '15 at 20:05

filtering with multiple conditions on many columns using dplyr

7 Answers7

Linked