2

Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .

There's a long way to do this which is something like

new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100 

and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?

I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box

Take the iris dataset

Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.

#this only goes for 4.0

brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))

but what I want is something like

brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))

Is there a dplyr way to do this?

Jin
  • 527
  • 6
  • 21

5 Answers5

3

A corresponding across() version from @RonakShah's answer:

library(dplyr)

iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)

or

iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)

From vignette("colwise"):

Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().

So you need something like rowSums(...) > 0 to achieve the effect of any_vars().

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
2

You can use filter_at :

library(dplyr)
iris %>%  filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))

#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#1           4.9         3.0          1.4         0.2     setosa
#2           4.7         3.2          1.3         0.2     setosa
#3           4.6         3.1          1.5         0.2     setosa
#4           5.0         3.6          1.4         0.2     setosa
#5           4.6         3.4          1.4         0.3     setosa
#6           5.0         3.4          1.5         0.2     setosa
#7           4.4         2.9          1.4         0.2     setosa
#....
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Cool! Say I wanted to also add a jump for codes I select like not just 4 to 5 but also include 5.5 to 6. How would I do that? Also I tried method where it just uses the `any_vars` function with just `new_dat = iris %>% filter_at(vars(contains('Sepal')), any_vars(4))`, but this returns the entire dataframe and not rows that just have 4 in the selected columns. is there something im doing wrong – Jin Jun 13 '20 at 05:41
  • @762 Do you mean `iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5) | between(., 5.5, 6)))` ? – Ronak Shah Jun 13 '20 at 05:45
  • Yes exactly! Can this be applied in other ways besides the between method? Like just `any_vars(4 | 4.1)` to get values that have either 4 or 4.1? I tried that but just had the full dataframe returned back to me. – Jin Jun 13 '20 at 05:49
  • Yes, you can do `iris %>% filter_at(vars(contains('Sepal')), any_vars(. %in% c(4, 4.1)))` and add all the numbers that you want in `c()` but remember that floating point comparisons are not accurate. Read https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal – Ronak Shah Jun 13 '20 at 05:55
1

Base R:

# Subset: 
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
                   function(x) {
                     df2[, x] >= 100 & df2[, x] <= 120
                   })) == length(cols), ]
# Data: 
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
  setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)
hello_friend
  • 5,682
  • 1
  • 11
  • 15
1

One option could be:

iris %>%
 filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           4.9         3.0          1.4         0.2       1
2           4.7         3.2          1.3         0.2       1
3           4.6         3.1          1.5         0.2       1
4           5.0         3.6          1.4         0.2       1
5           4.6         3.4          1.4         0.3       1
6           5.0         3.4          1.5         0.2       1
7           4.4         2.9          1.4         0.2       1
8           4.9         3.1          1.5         0.1       1
9           4.8         3.4          1.6         0.2       1
10          4.8         3.0          1.4         0.1       1
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
0
library(dplyr)
df <- iris
# value to look for
val <- 4 
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>%  filter_at(cols, any_vars(.==val))
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.8         4.0          1.2         0.2     setosa
2          5.5         2.3          4.0         1.3 versicolor
3          6.0         2.2          4.0         1.0 versicolor
4          6.1         2.8          4.0         1.3 versicolor
5          5.5         2.5          4.0         1.3 versicolor
6          5.8         2.6          4.0         1.2 versicolor
Waldi
  • 39,242
  • 6
  • 30
  • 78