In R, how can I filter based on the maximum value in each row of my data?

Question

I have a tibble (or data frame, if you like) that is 19 columns of pure numerical data and I want to filter it down to only the rows where at least one value is above or below a threshold. I prefer a tidyverse/dplyr solution but whatever works is fine.

This is related to this question but a distinct in at least two ways that I can see:

I have no identifier column (besides the row number, I suppose)
I need to subset based on the max across the current row being evaluated, not across a column

Here are attempts I've tried:

data %>% filter(max(.) < 8)
data %>% filter(max(value) < 8)
data %>% slice(which.max(.))

`mtcars[rowSums(subset(mtcars, , select=cyl:drat) > 200) > 0,]` works for columns between (including) `cyl` and `drat` for values above 200. Works just as well for other inequalities. You can also use "all data" by just doing `rowSums(mtcars > 200) > 0`. If you want something specific to your data, please make this question reproducible. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. — r2evans, Aug 07 '19 at 00:37

Shree · Accepted Answer · 2019-08-07T00:52:03.160

3

Here's a way which will keep rows having value above threshold. For keeping values below threshold, just reverse the inequality in any -

data %>% 
  filter(apply(., 1, function(x) any(x > threshold)))

Actually, @r2evans has better answer in comments -

data %>%
  filter(rowSums(. > threshold) >= 1)

edited Aug 07 '19 at 00:52

answered Aug 07 '19 at 00:38

Shree

10,835
1
14
36

score 2 · Answer 2 · answered Aug 07 '19 at 00:54

2

Couple more options that should scale pretty well:

library(dplyr)

# a more dplyr-y option 
iris %>%
      filter_all(any_vars(. > 5))

# or taking advantage of base functions
iris %>%
      filter(do.call(pmax, as.list(.))>5)

answered Aug 07 '19 at 00:54

Andrew

5,028
2
11
21

1

Wasn't aware of `any_vars`. Nice! +1 – Shree Aug 07 '19 at 01:13

score 0 · Answer 3 · answered Aug 07 '19 at 01:10

Maybe there are better and more efficient ways, but these two functions should do what you need if I understood correctly. This solution assumes you have only numerical data.

You transpose the tibble (so you obtain a numerical matrix)
Then you use map to get the max or min by column (which is the max/min by row in the initial dataset).
You obtain the row index you are looking for
Finally, you can filter your dataset.


# Random Data -------------------------------------------------------------

data <- as.tibble(replicate(10, runif(20)))

# Threshold to be used -----------------------------------------------------

max_treshold = 0.9
min_treshold = 0.1

# Lesser_max --------------------------------------------------------------

lesser_max = function(data, max_treshold = 0.9) {
  index_max_list =
    data %>%
    t() %>%
    as.tibble() %>%
    map(max) %>%
    unname()

  index_max =
    index_max_list < max_treshold

  data[index_max,]
}

# Greater_min -------------------------------------------------------------

greater_min = function(data, min_treshold = 0.1) {
  index_min_list =
    data %>%
    t() %>%
    as.tibble() %>%
    map(min) %>%
    unname()

  index_min =
    index_min_list > min_treshold

  data[index_min,]
}

# Examples ----------------------------------------------------------------

data %>%
  lesser_max(max_treshold)

data %>%
  greater_min(min_treshold)

score 0 · Answer 4 · answered Aug 07 '19 at 02:52

0

We can use base R methods

data[Reduce(`|`, lapply(data, `>`, threshold)),]`

answered Aug 07 '19 at 02:52

akrun

874,273
37
540
662

In R, how can I filter based on the maximum value in each row of my data?

4 Answers4

Linked