Filter multiple values on a string column in dplyr

Question

I have a data.frame with character data in one of the columns. I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?

Example: data.frame name = dat

days      name
88        Lynn
11        Tom
2         Chris
5         Lisa
22        Kyla
1         Tom
222       Lynn
2         Lynn

I'd like to filter out Tom and Lynn for example.
When I do:

target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)

I get this error:

longer object length is not a multiple of shorter object length

BrodieG · Accepted Answer · 2014-09-03T15:01:36.720

You need %in% instead of ==:

library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target)  # equivalently, dat %>% filter(name %in% target)

Produces

  days name
1   88 Lynn
2   11  Tom
3    1  Tom
4  222 Lynn
5    2 Lynn

To understand why, consider what happens here:

dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Basically, we're recycling the two length target vector four times to match the length of dat$name. In other words, we are doing:

 Lynn == Tom
  Tom == Lynn
Chris == Tom
 Lisa == Lynn
 ... continue repeating Tom and Lynn until end of data frame

In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target is equivalent to saying:

return TRUE for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".

It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE above.

To contrast, dat$name %in% target says:

for each value in dat$name, check that it exists in target.

Very different. Here is the result:

[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Note your problem has nothing to do with dplyr, just the mis-use of ==.

Thanks for the explanation Brodie! Really appreciate this, clinician trying to figure out R! — Tom O, Sep 03 '14 at 15:30
@BrodieG and could you make target with pattern, not full string? — , Feb 04 '20 at 13:01
Not with `%in%`, but you can do `grepl("T[oi]m|lynne?", name)` and use whatever pattern you want there. — BrodieG, Feb 06 '20 at 03:29
@user9440895 check my [answer](https://stackoverflow.com/a/71026441/9550633) using `stringr`. — rubengavidia0x, Feb 07 '22 at 23:08
At best this statement: "Basically, we're recycling the two length target vector four times to match the length of dat$name. " is confusing, but I think its just wrong. There's no recycling going on. Underneath the hood, the `%in%` operator is just a match operation. — IRTFM, Jul 03 '22 at 19:20

score 14 · Answer 2 · edited Dec 11 '19 at 10:13

This can be achieved using dplyr package, which is available in CRAN. The simple way to achieve this:

Install dplyr package.
Run the below code

library(dplyr) 

df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))

Explanation:

So, once we’ve downloaded dplyr, we create a new data frame by using two different functions from this package:

filter: the first argument is the data frame; the second argument is the condition by which we want it subsetted. The result is the entire data frame with only the rows we wanted. select: the first argument is the data frame; the second argument is the names of the columns we want selected from it. We don’t have to use the names() function, and we don’t even have to use quotation marks. We simply list the column names as objects.

mpalanco · Answer 3 · 2015-06-24T10:10:34.480

Using the base package:

df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))

# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]

# One line
df[df$name %in% c("Tom", "Lynn"), ]

Output:

  days name
1   88 Lynn
2   11  Tom
6    1  Tom
7  222 Lynn
8    2 Lynn

Using sqldf:

library(sqldf)
# Two alternatives:
sqldf('SELECT *
      FROM df 
      WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
      FROM df 
      WHERE name IN ("Tom", "Lynn")')

score 2 · Answer 4 · answered Mar 07 '22 at 17:47

2

Write that. Example:

library (dplyr)

target <- YourData%>% filter (YourColum %in% c("variable1","variable2"))

Example with your data

target <- df%>% filter (names %in% c("Tom","Lynn"))

answered Mar 07 '22 at 17:47

Rafa Mesa

55
7

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 08 '22 at 05:14

score 1 · Answer 5 · answered May 16 '20 at 02:24

1

 by_type_year_tag_filtered <- by_type_year_tag %>%
      dplyr:: filter(tag_name %in% c("dplyr", "ggplot2"))

answered May 16 '20 at 02:24

Hanif

31
2

2

While this code may provide a solution to problem, it is highly recommended that you provide additional context regarding why and/or how this code answers the question. Code only answers typically become useless in the long-run because future viewers experiencing similar problems cannot understand the reasoning behind the solution. – palaѕн May 16 '20 at 04:36

score 0 · Answer 6 · answered Feb 07 '22 at 22:53

In case you have long strings as values in your string columns you can use this powerful method with the stringr package. A method that filter( %in% ) and base R can't do.

library(dplyr)
library(stringr)

sentences_tb = as_tibble(sentences) %>%
                 mutate(row_number())
sentences_tb
# A tibble: 720 x 2
   value                                       `row_number()`
   <chr>                                                <int>
 1 The birch canoe slid on the smooth planks.               1
 2 Glue the sheet to the dark blue background.              2
 3 Its easy to tell the depth of a well.                   3
 4 These days a chicken leg is a rare dish.                 4
 5 Rice is often served in round bowls.                     5
 6 The juice of lemons makes fine punch.                    6
 7 The box was thrown beside the parked truck.              7
 8 The hogs were fed chopped corn and garbage.              8
 9 Four hours of steady work faced us.                      9
10 Large size in stockings is hard to sell.                10
# ... with 710 more rows                

matching_letters <- c(
  "canoe","dark","often","juice","hogs","hours","size"
)
matching_letters <- str_c(matching_letters, collapse = "|")
matching_letters
[1] "canoe|dark|often|juice|hogs|hours|size"

letters_found <- str_subset(sentences_tb$value,matching_letters)
letters_found_tb = as_tibble(letters_found)
inner_join(sentences_tb,letters_found_tb)

# A tibble: 16 x 2
   value                                          `row_number()`
   <chr>                                                   <int>
 1 The birch canoe slid on the smooth planks.                  1
 2 Glue the sheet to the dark blue background.                 2
 3 Rice is often served in round bowls.                        5
 4 The juice of lemons makes fine punch.                       6
 5 The hogs were fed chopped corn and garbage.                 8
 6 Four hours of steady work faced us.                         9
 7 Large size in stockings is hard to sell.                   10
 8 Note closely the size of the gas tank.                     33
 9 The bark of the pine tree was shiny and dark.             111
10 Both brothers wear the same size.                         253
11 The dark pot hung in the front closet.                    261
12 Grape juice and water mix well.                           383
13 The wall phone rang loud and often.                       454
14 The bright lanterns were gay on the dark lawn.            476
15 The pleasant hours fly by much too soon.                  516
16 A six comes up more often than a ten.                     609

It's a bit verbose, but it's very handy and powerful if you have long strings and want to filter in what row is located a specific word.

Comparing with the accepted answers:

> target <- c("canoe","dark","often","juice","hogs","hours","size")
> filter(sentences_tb, value %in% target)
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>

> df<- select(filter(sentences_tb,value=='canoe'| value=='dark'), c('value','row_number()'))
> df
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>

> target <- c("canoe","dark","often","juice","hogs","hours","size")
> index <- sentences_tb$value %in% target
> sentences_tb[index, ]
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>

You need to write all the sentences to get the desired result.

score 0 · Answer 7 · answered May 05 '23 at 09:09

Another option could be using slice with which to get the indexes of the values you want to filter them. Here is some reproducible code:

library(dplyr)
df %>%
  slice(which(name %in% c("Tom", "Lynn")))
#>   days name
#> 1   88 Lynn
#> 2   11  Tom
#> 3    1  Tom
#> 4  222 Lynn
#> 5    2 Lynn

^{Created on 2023-05-05 with reprex v2.0.2}

Data used:

df = read.table(text = "days      name
88        Lynn
11        Tom
2         Chris
5         Lisa
22        Kyla
1         Tom
222       Lynn
2         Lynn", header = TRUE)

Filter multiple values on a string column in dplyr

7 Answers7

It's a bit verbose, but it's very handy and powerful if you have long strings and want to filter in what row is located a specific word.

Linked

Related