Is it possible to remove duplicate sentences / blocks of text in R?

Question

I was wondering whether or not it was possible to remove duplicate sentences or even duplicated blocks of texts, meaning a duplicate set of sentences from a dataframe in R. In my specific case, you could imagine I have saved the posts of a forum but have not highlighted when a person quoted a post that has been made before, and now want to remove all quotes from the different cells containing the different posts. Thanks for any tips or hints.

An example could look something like this:

    names <- c("Richard", "Mortimer", "Elizabeth", "Jeremiah")
    posts <- c("I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out. That sounds quite aggressive. How about just talking to them in a friendly way, first?", "That sounds quite aggressive. How about just talking to them in a friendly way, first? Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense")

    duplicateposts <- data.frame(names, posts)

    posts2 <- c("I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.", "That sounds quite aggressive. How about just talking to them in a friendly way, first?", "Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense")

    postsnoduplicates <- data.frame(names, posts2)

While imagining is fun, could you please provide a minimal reproducible example of the data you are using, and an expected result? See - https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example - for some guidelines. — thelatemail, Aug 29 '19 at 21:37

score 3 · Accepted Answer · answered Aug 29 '19 at 22:37

I think you need to strsplit at the point of sentence ends, find duplicates, then paste back together. Something like:

spl <- strsplit(as.character(duplicateposts$posts), "(?<=[.?!])(?=.)", perl=TRUE)
spl <- lapply(spl, trimws)
spl <- stack(setNames(spl, duplicateposts$names))
aggregate(values ~ ind, data=spl[!duplicated(spl$values),], FUN=paste, collapse=" ")

Resulting in:

#        ind                                                                                                                                              values
#1   Richard I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift
#2  Mortimer Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.
#3 Elizabeth                                                              That sounds quite aggressive. How about just talking to them in a friendly way, first?
#4  Jeremiah                                                   Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense

kstew · Answer 2 · 2019-08-29T22:52:56.473

Here is a somewhat imperfect solution using the example data. The logic is to split each person's post into separate sentences (indicated by ? or .), then remove duplicated sentences. The order of posts/names is important, so I have created an order variable.

library(dplyr); library(tidyr); library(stringr)

names <- c("Richard", "Mortimer", "Elizabeth", "Jeremiah")
posts <- c("I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out. That sounds quite aggressive. How about just talking to them in a friendly way, first?", "That sounds quite aggressive. How about just talking to them in a friendly way, first? Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense")
dp1 <- data.frame(names, posts)

dp1 <- dp1 %>% mutate(order=rownames(.))

dp1 <- cbind(dp1,str_split(dp1$posts,'\\.|\\?',simplify = T)) %>% 
  gather(k,v,-order,-names,-posts) %>% filter(v!='') %>% 
  mutate(v=str_trim(v))

dp1 %>% arrange(order) %>% group_by(v) %>% slice(1) %>% arrange(order,k) %>% 
  group_by(names) %>% summarise(post2=paste0(v,collapse = '. '))

# A tibble: 4 x 2
  names     post2                                                                                              
  <fct>     <chr>                                                                                              
1 Elizabeth That sounds quite aggressive. How about just talking to them in a friendly way, first              
2 Jeremiah  Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense  
3 Mortimer  Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't sh~
4 Richard   I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sunday~

Thanks for your answer. But r doesn't find 'gather' and I can't find it in the documentations, either. Do you know which package I need for that? — psyph, Aug 29 '19 at 22:49
Sorry, you need the `dplyr`, `tidyr`, and `stringr` packages. I will edit my post. — kstew, Aug 29 '19 at 22:52
Now that I think about it, do you know how could incorporate that this should only be run within a certain post, given by a third variable called postitle, which for this case would be "garden". It would be so that if I have a large dataframe, it wouldn't cut out quotes that are just duplicate by chance from other forums. — psyph, Aug 29 '19 at 23:17
If you have another grouping variable `posttitle`, then you can add it to the second-to-last `group_by` command: `group_by(posttitle, v)` — kstew, Aug 30 '19 at 04:31

Is it possible to remove duplicate sentences / blocks of text in R?

2 Answers2