I have a large number of word files that I imported into r as text (each report in a cell) with an ID for each subject.
then I used the distinct
function from dplyr to remove the duplicated ones.
however, some reports are exactly the same but with a minor difference (eg extra/less few words, extra space, etc ...), so dplyr did not count them as duplicates. is there an efficient way to remove "highly similar" items in r?
This creates an example dataset (very simplified comapred to the original data I am working on:
d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"all plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))
This is the dplyr code to remove exact duplicates. However, you will notice that items 2, 7 and 8 are almost the same
library(dplyr)
d %>%
distinct(text, .keep_all = T) %>%
View()
looks like there is a like
function in dplyr but I could find how to apply it exactly here (also it seem to work for short strings only, eg words) dplyr filter() with SQL-like %wildcard%
also, there is a package tidystringdist
that can calculate how similar 2 string are but could not find a way to apply it here to remove items that are similar but not identical.
https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html
any suggestions or guidance at this point?
Update:
looks like the package stringdist
may be solve this as suggested by the user below.
This question from rstudio website deals with a similar issue, although the desired output is a bit different. I applied their code to my data and was able to identify the similar ones. https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2
library(tidystringdist)
library(tidyverse)
# First remove any duplicates:
d =d %>%
distinct(text, .keep_all = T) %>%
View()
# this will identify the similar ones and place then in one dataframe called match:
match <- d %>%
tidy_comb_all(text) %>%
tidy_stringdist() %>%
filter(soundex == 0) %>% # Set a threshold
gather(x, match, starts_with("V")) %>%
.$match
# create negate function of %in%:
`%!in%` = Negate(`%in%`)
# this will remove those in the `match` out of `d` :
d2 = d %>%
filter(text %!in% match) %>%
arrange(text)
using the code above, d2 does not have ANY of the duplicates / similar ones at all but I would like to keep one copy of them.
Any thoughts on how to keep one copy (eg only first occurrence of them)?