1

I have a tibble (or data frame, if you like) that is 19 columns of pure numerical data and I want to filter it down to only the rows where at least one value is above or below a threshold. I prefer a tidyverse/dplyr solution but whatever works is fine.

This is related to this question but a distinct in at least two ways that I can see:

  1. I have no identifier column (besides the row number, I suppose)
  2. I need to subset based on the max across the current row being evaluated, not across a column

Here are attempts I've tried:

data %>% filter(max(.) < 8)
data %>% filter(max(value) < 8)
data %>% slice(which.max(.))
Zelbinian
  • 3,221
  • 5
  • 20
  • 23
  • 3
    `mtcars[rowSums(subset(mtcars, , select=cyl:drat) > 200) > 0,]` works for columns between (including) `cyl` and `drat` for values above 200. Works just as well for other inequalities. You can also use "all data" by just doing `rowSums(mtcars > 200) > 0`. If you want something specific to your data, please make this question reproducible. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. – r2evans Aug 07 '19 at 00:37

4 Answers4

3

Here's a way which will keep rows having value above threshold. For keeping values below threshold, just reverse the inequality in any -

data %>% 
  filter(apply(., 1, function(x) any(x > threshold)))

Actually, @r2evans has better answer in comments -

data %>%
  filter(rowSums(. > threshold) >= 1)
Shree
  • 10,835
  • 1
  • 14
  • 36
2

Couple more options that should scale pretty well:

library(dplyr)

# a more dplyr-y option 
iris %>%
      filter_all(any_vars(. > 5))

# or taking advantage of base functions
iris %>%
      filter(do.call(pmax, as.list(.))>5)
Andrew
  • 5,028
  • 2
  • 11
  • 21
0

Maybe there are better and more efficient ways, but these two functions should do what you need if I understood correctly. This solution assumes you have only numerical data.

  • You transpose the tibble (so you obtain a numerical matrix)
  • Then you use map to get the max or min by column (which is the max/min by row in the initial dataset).
  • You obtain the row index you are looking for
  • Finally, you can filter your dataset.

# Random Data -------------------------------------------------------------

data <- as.tibble(replicate(10, runif(20)))

# Threshold to be used -----------------------------------------------------

max_treshold = 0.9
min_treshold = 0.1

# Lesser_max --------------------------------------------------------------

lesser_max = function(data, max_treshold = 0.9) {
  index_max_list =
    data %>%
    t() %>%
    as.tibble() %>%
    map(max) %>%
    unname()

  index_max =
    index_max_list < max_treshold

  data[index_max,]
}

# Greater_min -------------------------------------------------------------

greater_min = function(data, min_treshold = 0.1) {
  index_min_list =
    data %>%
    t() %>%
    as.tibble() %>%
    map(min) %>%
    unname()

  index_min =
    index_min_list > min_treshold

  data[index_min,]
}

# Examples ----------------------------------------------------------------

data %>%
  lesser_max(max_treshold)

data %>%
  greater_min(min_treshold)

0

We can use base R methods

data[Reduce(`|`, lapply(data, `>`, threshold)),]`
akrun
  • 874,273
  • 37
  • 540
  • 662