1

I want to filter out specific rows from a data set I got from the project Gutenberg r package. For that, I want to select only rows that contain a given word, but the problem is all my rows have got more than one word so using the filter() will not work.

For example:

The sentence is: "The Little Vanities of Mrs. Whittaker: A Novel". I want to filter out all the rows that contain the word "novel", but I can not find out how.

gutenberg_full_data <- left_join(gutenberg_works(language == "en"), gutenberg_metadata, by = "gutenberg_id")

gutenberg_full_data <- left_join(gutenberg_full_data, gutenberg_subjects)

gutenberg_full_data <- subset(gutenberg_full_data, select = -c(rights.x,has_text.x,language.y,gutenberg_bookshelf.x, gutenberg_bookshelf.y,rights.y, has_text.y,gutenberg_bookshelf.y, gutenberg_author_id.y, title.y, author.y))

gutenberg_full_data <- gutenberg_full_data[-which(is.na(gutenberg_full_data$author.x)),]
novels <- gutenberg_full_data %>% filter(subject == "Drama")

original_books <- gutenberg_download((novels), meta_fields = "title")

original_books

tidy_books <- original_books %>%
  unnest_tokens(word, text)

This is the code I used to get the data frame using the "gutenbergr" package.

macropod
  • 12,757
  • 2
  • 9
  • 21

2 Answers2

1

You are probably looking for something like below. It will look for any string that contains the keyword you put in.

stringr::str_detect(variable, "keyword")

Example to subset only the specific string

library(stringr)


df <- df %>% filter(str_detect(column_that_contains_the_word, "the word"))

In your case (I assume) to filter out the specific string and keep all other

library(stringr)


original_books <- original_books %>% filter(!str_detect(title, c("novel", "Novel", "NOVEL")))

Let us know if it worked.

  • The ! before the str_detect was wrong. but otherwise worked great, thank you. And sorry if this post was a duplicate. – MisterCoder Jan 04 '22 at 11:53
  • The ! before means that we exclude it when subsetting. Without it you are just subsetting a specific pattern/string and excluding all else. Also, if it worked please provide that the answer solved your problem. – geometricfreedom Jan 04 '22 at 16:12
  • word = c("novel", "Novel", "NOVEL") novels <- gutenberg_full_data %>% filter(str_detect(title.x,word)) – MisterCoder Jan 04 '22 at 17:05
  • Would be great if you could mark my answer as the one that solved your question. – geometricfreedom Jan 04 '22 at 23:00
0

You can use grepl() from base R for this. grepl() returns True if the word is present and False otherwise.

text = "The Little Vanities of Mrs. Whittaker: A Novel"
word = "Novel"

> grepl(word, text)

[1] TRUE

Your original_books file will require large downloads so I'm showing you an example of searching "Plays" in title.x of your novels data frame.

> novels %>% 
     mutate(contains_play = grepl("Plays", title.x))

# A tibble: 54 × 8
   gutenberg_id title.x          author.x      gutenberg_autho… language.x subject_type subject contains_play
          <int> <chr>            <chr>                    <int> <chr>      <chr>        <chr>   <lgl>        
 1         1308 A Florentine Tr… Wilde, Oscar               111 en         lcsh         Drama   FALSE        
 2         2270 Shakespeare's F… Shakespeare,…               65 en         lcsh         Drama   FALSE        
 3         2587 Life Is a Dream  Calderón de …              970 en         lcsh         Drama   FALSE        
 4         4970 There Are Crime… Strindberg, …             1609 en         lcsh         Drama   FALSE        
 5         5053 Plays by August… Strindberg, …             1609 en         lcsh         Drama   TRUE         
 6         5618 Six Plays        Darwin, Flor…             1814 en         lcsh         Drama   TRUE         
 7         6587 King Arthur's S… Dell, Floyd               2100 en         lcsh         Drama   TRUE         
 8         6782 The Robbers      Schiller, Fr…              289 en         lcsh         Drama   FALSE        
 9         6790 Demetrius: A Pl… Schiller, Fr…              289 en         lcsh         Drama   FALSE        
10         6793 The Bride of Me… Schiller, Fr…              289 en         lcsh         Drama   FALSE        
# … with 44 more rows

Note that grepl() allows the second argument to be a vector. Thus, using rowwise() is not necessary. If it allowed searching only within a string, we would have to use rowwise().

Harshvardhan
  • 479
  • 1
  • 3
  • 12