R Extract duplicate words in string

Question

I have strings a and b that compose my data. My purpose is to obtain a new variable that contains repeated words.

    a = c("the red house av", "the blue sky", "the green grass")
    b = c("the house built", " the sky of the city", "the grass in the garden")

data = data.frame(a, b)

Based on this answer I can get the logical of those that are repeated with duplicated()

data = data%>% mutate(c = paste(a,b, sep = " "),
                     d = vapply(lapply(strsplit(c, " "), duplicated), paste, character(1L), collapse = " "))

Yet I am not able to obtain the words. My desired data should be something like this

> data.1
                 a                       b         d
1 the red house av         the house built the house
2     the blue sky     the sky of the city   the sky
3  the green grass the grass in the garden the grass

Any help on the function above would be highly appreciated.

You're just about there. your current `d` is a logical vector of TRUE for duplicated words and FALSE for non-duplicated; just use it to subset `c`. E.g. change the `duplicated` to `function (x) x[duplicate(x)]`. — mathematical.coffee, Sep 27 '16 at 05:33
Many thanks @mathematical.coffee. It would be like: `data = data%>% mutate(c = paste(a,b, sep = " "), d = vapply(lapply(strsplit(c, " "), function (x) x[duplicated(x)]), paste, character(1L), collapse = " "))` — Edu, Sep 27 '16 at 05:47
@Edu careful with that. Your function is tokenizing the words after they have been pasted together, which means it can't distinguish if the word came from `a` or from `b`. See what happens if you first update your data and then run that function: `data$a[1] <- "the red house av av"`. "av" appears as duplicated even though it doesn't appear in `b`. — Chrisss, Sep 27 '16 at 06:01

score 5 · Accepted Answer · answered Sep 27 '16 at 05:32

a = c("the red house av", "the blue sky", "the green grass")
b = c("the house built", " the sky of the city", "the grass in the garden")

data <-  data.frame(a, b, stringsAsFactors = FALSE)

func <- function(dta) {
    words <- intersect( unlist(strsplit(dta$a, " ")), unlist(strsplit(dta$b, " ")) )
    dta$c <- paste(words, collapse = " ")
    return( as.data.frame(dta, stringsAsFactors = FALSE) )
}

library(dplyr)
data %>% rowwise() %>% do( func(.) )

Result:

#Source: local data frame [3 x 3]
#Groups: <by row>
#
## A tibble: 3 x 3
#                 a                       b         c
#*            <chr>                   <chr>     <chr>
#1 the red house av         the house built the house
#2     the blue sky     the sky of the city   the sky
#3  the green grass the grass in the garden the grass

score 1 · Answer 2 · answered Sep 27 '16 at 08:48

Here is another attempt using base R (no package needed):

df$c <- apply(df,1,function(x) 
               paste(Reduce(intersect, strsplit(x, " ")), collapse = " "))

                 # a                       b         c
# 1 the red house av         the house built the house
# 2     the blue sky     the sky of the city   the sky
# 3  the green grass the grass in the garden the grass

data

df <- structure(list(a = c("the red house av", "the blue sky", "the green grass"
), b = c("the house built", " the sky of the city", "the grass in the garden"
)), .Names = c("a", "b"), row.names = c(NA, -3L), class = "data.frame")

R Extract duplicate words in string

2 Answers2