I have a dataframe that looks like this:
Example:
name <- c("p1", "p2", "p3", "p1", "p4","h1", "h2", "p2", "p3", "k1", "k2", "p1", "h2")
text <- c("This is the first sentence of p1. This is the second sentence of p1.", "This is the first sentence of p1. This is the first sentence of p2.", "This is the first sentence of p1. This is the first sentence of p2. This is the first sentence of p3.", "This is the first sentence of p2. This is the third sentence of p1.", "This is the first sentence of p4.", "This is the first sentence of h1. This is the second sentence of h1.", "This is the first sentence of h1. This is the first sentence of h2.", "This is the first sentence of p2.", "This is the first sentence of h1. This is the first sentence of p3.", "This is the first sentence of k1.", "This is the first sentence of k1. This is the first sentence of k2.", "This is the first sentence of k1. This is the first sentence of p1.", "This is the first sentence of p1. This is the first sentence of h1.")
group <- c("gr1", "gr1", "gr1", "gr1", "gr1","gr2", "gr2", "gr2", "gr2", "gr3", "gr3", "gr3", "gr3")
frame1 <- data.frame(name, text, group)
The variable text consists of sentences, that have a lot of duplicates. I want to remove all duplicates, but only do so within a certain group. Outside of the group, duplicate sentences can remain duplicate sentences. The group is indicated by the variable 'group'. What I have come up with is only a start, but it's the only way I could think of, and I was hoping someone could help me get along with it.
list1 <- tokenize_sentences(as.character(frame1$text))
frame2 = enframe(list1) %>%
unnest
If now I could somehow get my variable 'group' from frame1 to be in frame2, hence to also have 24 rows that are matching the variable names from frame with regard to their original order, I could continue. In order to get along, I have just computed the desired vector, that I want to add to frame2, by hand:
desiredvector <- c("gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr1","gr2","gr2","gr2","gr2","gr2","gr2","gr2","gr3","gr3","gr3","gr3","gr3","gr3","gr3")
desiredframe <- cbind(desiredvector, frame2)
cleanframe = desiredframe %>%
group_by(desiredvector) %>%
distinct(value, .keep_all = T)
finalframe = aggregate(value ~ name, cleanframe, paste(sep = " ", collapse = NULL)
Regarding finalframe, I also wanted to ask if anybody knows how I could get rid of the comma, that the toString argument adds to the concatenation of the final strings.
Thanks for any help, and also for any suggestions on how I could further improve the procedure or solve it in a different way.