Filter rows which contain a certain string

Question

I have to filter a data frame using as criterion those row in which is contained the string RTB.

I'm using dplyr.

d.del <- df %>%
  group_by(TrackingPixel) %>%
  summarise(MonthDelivery = as.integer(sum(Revenue))) %>%
  arrange(desc(MonthDelivery))

I know I can use the function filter in dplyr but I don't exactly how to tell it to check for the content of a string.

In particular I want to check the content in the column TrackingPixel. If the string contains the label RTB I want to remove the row from the result.

I have never used `dplyr`, but looking at the help in `?dplyr::filter` i'd suggest something like `filter(df, !grepl("RTB",TrackingPixel))` maybe? — thelatemail, Apr 03 '14 at 22:22
This is actually close to what I want to achieve. The only issue is that is maintaining those string which include the label `RTB` and not showing the others. — Gianluca, Apr 03 '14 at 22:29
I just put in a stealth edit, that is reversed now by adding the `!` in front of `grepl` - try it again. — thelatemail, Apr 03 '14 at 22:30
Or use the `invert` and `value` arguments of `grep`. Regular expressions make working with text a thousand times easier. — Rich Scriven, Apr 04 '14 at 02:37
@thelatemail `grepl` doesn't work on postgres for me, is this for MySQL? — Statwonk, Apr 20 '14 at 21:17

alex23lemm · Answer 1 · 2017-07-28T15:23:24.937

366

The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter like this:

dplyr::filter(df, !grepl("RTB",TrackingPixel))

Since you have not provided the original data, I will add a toy example using the mtcars data set. Imagine you are only interested in cars produced by Mazda or Toyota.

mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter command looks like this:

dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))

edited Jul 28 '17 at 15:23

answered Jul 18 '14 at 09:03

alex23lemm

5,475
1
21
23

what if the column name contains a space. like Tracking Pixels. – MySchizoBuddy May 09 '15 at 23:19
4

make sure you are using the filter function from the dplyr package, not the stats package – JHowIX May 20 '15 at 21:58
2

@MySchizoBuddy: If the column name contains white space, you can select the variable using backticks. Modifying the example above: `mtcars$\`my type\` <- rownames(mtcars)` and then `mtcars %>% filter(grepl('Toyota|Mazda', \`my type\`))` – alex23lemm Jun 09 '15 at 21:35
13

note that this does not work when the object is a `tbl_sql` as `grepl` does not translate to sql. – David LeBauer Aug 11 '15 at 17:17
1

option 1 is to know for sure that dplyr loaded last. option 2 is you prefix dplyr::filter. – userJT Mar 01 '16 at 14:30

Keiku · Answer 2 · 2018-04-28T14:50:50.583

Solution

It is possible to use str_detect of the stringr package included in the tidyverse package. str_detect returns True or False as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr package.

library(tidyverse)
# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
# ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
# ✔ tibble  1.4.2     ✔ dplyr   0.7.4
# ✔ tidyr   0.7.2     ✔ stringr 1.2.0
# ✔ readr   1.1.1     ✔ forcats 0.3.0
# ─ Conflicts ───────────────────── tidyverse_conflicts() ─
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()

mtcars$type <- rownames(mtcars)
mtcars %>%
  filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
# 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
# 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
# 3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
# 4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

The good things about Stringr

We should use rather stringr::str_detect() than base::grepl(). This is because there are the following reasons.

The functions provided by the stringr package start with the prefix str_, which makes the code easier to read.
The first argument of the functions of stringr package is always the data.frame (or value), then comes the parameters.(Thank you Paolo)

object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"

Benchmark

The results of the benchmark test are as follows. For large dataframe, str_detect is faster.

library(rbenchmark)
library(tidyverse)

# The data. Data expo 09. ASA Statistics Computing and Graphics 
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728      29

benchmark(
  "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
  "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
  replications = 10,
  columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2      grepl           10  16.480    1.513    16.195    0.248
# 1 str_detect           10  10.891    1.000     9.594    1.281

@CameronNemo The functions provided by the `stringr` package start with the prefix str_, which makes the code easier to read. In recent modern R code, it is recommended to use stringr. — Keiku, Feb 25 '18 at 08:26
I think this is a very personal preference and I do agree with @CameronNemo that `base R` is as good as `stringr`. If you would provide us with some 'hard facts' such as benchmarking, and not only stating "it is recommended" (Who does recommend it?) , this would be highly appreciated. Thanks — tjebo, Apr 16 '18 at 13:20
An other reason is the consistency in the tidyverse framework: the first argument of a function is always the data.frame (or value), then comes the parameters. — Paolo, Apr 24 '18 at 10:21
Just stumbled into this post, I know it's an old discussion but just in case someone else gets here: one reason one may prefer `str_detect` over `grepl` is that if there's a missing value `str_detect` will return `NA`, while `grepl` will return `FALSE`, which can be misleading. — Claudio, Mar 08 '21 at 10:43
if I do without filter like: " mtcars %>% str_detect(type, 'Toyota|Mazda') " I receive Error in type(pattern) : object 'type' not found. Why do I not get an vector with TRUE FALSE ? — user2165379, Jun 16 '21 at 10:49
Add rowwise() is the solution to receive a vector with TRUE FALSE. See post below: https://stackoverflow.com/questions/26659198/detect-multiple-strings-with-dplyr-and-stringr — user2165379, Jun 16 '21 at 11:33
@Keiku Is it possible to use `str_detect` to directly detect strings in the rownames without adding these as a new column in the data frame? — Martin, Sep 15 '21 at 13:50

score 53 · Answer 3 · edited Feb 19 '19 at 22:06

This answer similar to others, but using preferred stringr::str_detect and dplyr rownames_to_column.

library(tidyverse)

mtcars %>% 
  rownames_to_column("type") %>% 
  filter(stringr::str_detect(type, 'Toyota|Mazda') )

#>             type  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1      Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 4  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Created on 2018-06-26 by the reprex package (v0.2.0).

tjebo · Answer 4 · 2023-02-22T15:52:28.170

Here's another dplyr solution, using filter(if_all/if_any). The advantage is that you can easily extend to more than one column. Below showing how to filter for rows with a given string in any column, using diamonds as example, looking for the string "V".

edit to reflect changes in the dplyr syntax (>=dplyr vs. 1.0.10). Previously using across (now deprecated) and even before that filter_all or filter_any (superseded).

Removing rows where any column fulfils a condition

library(dplyr)

## with if_any
ggplot2::diamonds %>%
  ## NB ! needs to come before if_any
  filter(!if_any(everything(), ~ grepl('V', .))) %>%
  head()
#> # A tibble: 6 × 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
#> 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
#> 3  0.31 Good    J     SI2      63.3    58   335  4.34  4.35  2.75
#> 4  0.3  Good    J     SI1      64      55   339  4.25  4.28  2.73
#> 5  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
#> 6  0.31 Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71

## or with if_all
ggplot2::diamonds %>%
  filter(if_all(everything(), ~ !grepl('V', .))) %>%
  head()
#> # A tibble: 6 × 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
#> 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
#> 3  0.31 Good    J     SI2      63.3    58   335  4.34  4.35  2.75
#> 4  0.3  Good    J     SI1      64      55   339  4.25  4.28  2.73
#> 5  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
#> 6  0.31 Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71

Filtering for rows where any column fulfils a condition

## The new syntax makes it also easy to positively filter rows 
## where one columns fulfils a condition
ggplot2::diamonds %>%
  filter(if_any(everything(), ~ grepl('V',.))) %>%
  head()
#> # A tibble: 6 × 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#> 2  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#> 3  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#> 4  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#> 5  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#> 6  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49

Do you know why `diamonds %>% filter(across(everything(), ~grepl('V', .)))` returns an empty tibble? I would think simply removing the `!` would return all rows with a V in any column? — Dylan Russell, Jul 29 '20 at 00:01
@DylanRussell sorry for the late answer. I thought this was obvious - using everything means that you're looking for rows that have "V" in ALL columns. — tjebo, Jan 25 '21 at 12:37
~ !grepl what is the use of ~ and ! could you please tell me ? — Daman deep, Jul 02 '21 at 04:30

tjebo · Answer 5 · 2022-11-23T16:43:51.440

A further option based on Akruns' suggestion - create a logical vector with rowSums and subset. Only with base R. This is in particular then useful and elegant, when the column contains exactly the value that we are looking for. (or if you can create a two-dimensional array with a simple conditional statement like below: df1 == "no_data")

## this is very easy when the expected value is EXACTLY the string
df1 <- structure(list(time = c("1:00", "2:00", "no_data", "3:00"), speed = c("30", "no_data", "no_data", "50"), wheels = c("no_data", "18", "no_data", "18")), .Names = c("time", "speed", "wheels"), class = "data.frame", row.names = c(NA, -4L))
df1[rowSums(df1 == "no_data") == 0, , drop = FALSE]
#>   time speed wheels
#> 4 3:00    50     18

## it's a bit more verbose when the expected value just CONTAINS the string
mtcars$type <- rownames(mtcars)
mtcars[rowSums(apply(mtcars, 2, \(x) grepl('Toyota|Mazda', x))) > 0, , drop = FALSE] |> head()
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#>                          type
#> Mazda RX4           Mazda RX4
#> Mazda RX4 Wag   Mazda RX4 Wag
#> Toyota Corolla Toyota Corolla
#> Toyota Corona   Toyota Corona

Filter rows which contain a certain string

5 Answers5

Linked

Related