Keep the matched patterns only in a list of sentences in R

Question

I have list of sentences and list of words, I want to update each sentence to keep only the words that are in the list of words.

for example I have the following words

"USA","UK","Germany","Australia","Italy","in","to"

and the following sentences:

"I lived in Germany 2 years", "I moved from Italy to USA", "people in USA, UK and Australia speak English"

I want to remove all words in the sentences that are not exiting in the words list so the expected output is the following sentences : "in Germany", "Italy to USA", "in USA UK Australia"

How can I do that using apply functions

mywords=data.frame(words=c("USA","UK","Germany","Australia","Italy","in","to"),
                   stringsAsFactors = F)
mysentences=data.frame(sentences=c("I lived in Germany 2 years",
                                   "I moved from Italy to USA",
                                   "people in USA, UK and Australia speak English"),
                   stringsAsFactors = F)

I misread this first time around; there's a very similar question with accepted answer here - http://stackoverflow.com/questions/28891130/only-keep-words-in-data-frame-that-are-found-in-vector-r — neilfws, May 03 '17 at 02:03
@neilfws - that can be adapted pretty easily - `sapply(strsplit(sentence, "[[:space:]|[:punct:]]"), intersect, vect)` for instance. — thelatemail, May 03 '17 at 02:55

score 2 · Answer 1 · answered May 03 '17 at 02:09

You can use a join to find the matching words, if you convert this text to a tidy data format. Then you can use purrr::map_chr() to get back to the strings you need.

library(tidyverse)
library(tidytext)

mywords <- data_frame(word = c("USA","UK","Germany","Australia","Italy","in","to"))

mysentences <- data_frame(sentences = c("I lived in Germany 2 years",
                                        "I moved from Italy to USA",
                                        "people in USA, UK and Australia speak English"))

mysentences %>% 
    mutate(id = row_number()) %>% 
    unnest_tokens(word, sentences, to_lower = FALSE) %>% 
    inner_join(mywords) %>% 
    nest(-id) %>%
    mutate(sentences = map(data, unlist),
           sentences = map_chr(sentences, paste, collapse = " ")) %>%
    select(-data)

#> Joining, by = "word"
#> # A tibble: 3 × 2
#>      id           sentences
#>   <int>               <chr>
#> 1     1          in Germany
#> 2     2        Italy to USA
#> 3     3 in USA UK Australia

score 2 · Answer 2 · answered May 03 '17 at 03:48

Here are two approaches. The first collapses the word list into a regex and then uses str_detect to match words with the regex:

library(tidyverse)
library(glue)

mywords=data_frame(words=c("USA","UK","Germany","Australia","Italy","in","to"))
mysentences=data_frame(sentences=c("This is a sentence with no words of word list",
                                   "I lived in Germany 2 years",
                                   "I moved from Italy to USA",
                                   "people in USA, UK and Australia speak English"))
mysentences %>% 
  filter(sentences %>% 
           str_detect(mywords$words %>% collapse(sep = "|") %>% regex(ignore_case = T)))
#> # A tibble: 3 × 1
#>                                       sentences
#>                                           <chr>
#> 1                    I lived in Germany 2 years
#> 2                     I moved from Italy to USA
#> 3 people in USA, UK and Australia speak English

The second approach uses fuzzyjoin's regex_semi_join (which uses str_detect behind the scenes and does the above work for you)

library(fuzzyjoin)
mysentences %>%
  regex_semi_join(mywords, by= c(sentences = "words"))
#> # A tibble: 3 × 1
#>                                       sentences
#>                                           <chr>
#> 1                    I lived in Germany 2 years
#> 2                     I moved from Italy to USA
#> 3 people in USA, UK and Australia speak English

score 1 · Answer 3 · answered May 03 '17 at 02:35

You can use stringr as well. My apologies for posting it twice. It was by mistake.

vect <- c("USA","UK","Germany","Australia","Italy","in","to")
sentence <- c("I lived in Germany 2 years", "I moved from Italy to USA", "people in USA, UK and Australia speak English")

library(stringr)
li <- str_extract_all(sentence,paste0(vect,collapse="|"))
d <- list()
for(i in 1:length(li){
  d[i] <- paste(li[[i]],collapse=" ")
}

unlist(d)

Output:

 > unlist(d)
[1] "in Germany"         
[2] "Italy to USA"       
[3] "in USA UK Australia"

Andrew Lavers · Answer 4 · 2017-05-03T10:36:09.733

1

This is suitable for shorter lists of words

library(stringr)
mywords_regex <- paste0(mywords$word, collapse = "|")
sapply(str_extract_all(mysentences$sentences, mywords_regex), paste, collapse = " ")

[1] "in Germany"          "Italy to USA"        "in USA UK Australia"

edited May 03 '17 at 10:36

answered May 03 '17 at 02:49

Andrew Lavers

4,328
1
12
19

score 0 · Accepted Answer · edited May 23 '17 at 12:02

Thanks All,

I solved it by the following code which was inspired from this answer using intersect function

vect <- data.frame( c("USA","UK","Germany","Australia","Italy","in","to"),stringsAsFactors = F)
sentence <- data.frame(c("I lived in Germany 2 years", "I moved from Italy to USA",
                         "people in USA     UK and    Australia speak English"),stringsAsFactors = F)

sentence[,1]=gsub("[^[:alnum:] ]", "", sentence[,1]) #remove special characters
sentence[,1]=sapply(sentence[,1], FUN =  function(x){ paste(intersect(strsplit(x, "\\s")[[1]], vect[,1]), collapse=" ")})

Keep the matched patterns only in a list of sentences in R

5 Answers5