1

I have a large number of word files that I imported into r as text (each report in a cell) with an ID for each subject.

then I used the distinct function from dplyr to remove the duplicated ones.

however, some reports are exactly the same but with a minor difference (eg extra/less few words, extra space, etc ...), so dplyr did not count them as duplicates. is there an efficient way to remove "highly similar" items in r?

This creates an example dataset (very simplified comapred to the original data I am working on:

d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))

This is the dplyr code to remove exact duplicates. However, you will notice that items 2, 7 and 8 are almost the same

library(dplyr)

d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

looks like there is a like function in dplyr but I could find how to apply it exactly here (also it seem to work for short strings only, eg words) dplyr filter() with SQL-like %wildcard%

also, there is a package tidystringdist that can calculate how similar 2 string are but could not find a way to apply it here to remove items that are similar but not identical. https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

any suggestions or guidance at this point?

Update:

looks like the package stringdist may be solve this as suggested by the user below.

This question from rstudio website deals with a similar issue, although the desired output is a bit different. I applied their code to my data and was able to identify the similar ones. https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2

library(tidystringdist)
library(tidyverse)

# First remove any duplicates: 
d =d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

# this will identify the similar ones and place then in one dataframe called match: 
match <- d %>% 
  tidy_comb_all(text) %>% 
  tidy_stringdist() %>% 
  filter(soundex == 0) %>% # Set a threshold
  gather(x, match, starts_with("V")) %>% 
  .$match

# create negate function of %in%:

 `%!in%` = Negate(`%in%`)

# this will remove those in the `match` out of `d` :
d2 = d %>% 
  filter(text %!in% match) %>% 
  arrange(text)


using the code above, d2 does not have ANY of the duplicates / similar ones at all but I would like to keep one copy of them.

Any thoughts on how to keep one copy (eg only first occurrence of them)?

Bahi8482
  • 489
  • 5
  • 15
  • Depending on the difference between strings you may have some different strategies. I recommend you give us some reproductive examples (a set of strings, the `dput`, and even a [reprex](https://github.com/tidyverse/reprex) example). Also, you may convert the string into a list and check overlaps between them. White spaces might be easily deal with `stringr::str_replace_all(your_string, "\\s+", " ")`. – Aureliano Guedes Oct 31 '20 at 22:15
  • @AurelianoGuedes thanks for the reply. the repex above should serve as a basic example. I can remove the extra space, new paragraph, change to lower case, etc... which will partially help as you mentioned. but the problem as above that some have extra few words that are too variable among strings to specify a regex that captures all of them. I search and work on writing a function that helps do the job but I wanted to ask here first as there might be a generic function available that I am not aware of. – Bahi8482 Oct 31 '20 at 22:26
  • 1
    this a related question but does not provide fine details (for average users at least) https://stackoverflow.com/questions/6683380/techniques-for-finding-near-duplicate-records – Bahi8482 Oct 31 '20 at 22:31
  • @Bahi8482 That question is 8 years old. Admittedly it is posed by a valued member of the R-respondents community, but it does not have answers that are up-to-date regarding currently available R packages. – IRTFM Nov 01 '20 at 02:15

2 Answers2

1
library(stringdist)


dd <- d[ !duplicated( d[['test']] ) , ]
dput(dd)
# --------------
[1] "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method."                                                                                                                                                                              
[2] "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength."                                                                                                                                                                                                          
[3] "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."    
[4] "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains." 
[5] "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."

unname( sapply(dd, stringdist, dd, method="dl") )
#------------------
     [,1] [,2] [,3] [,4] [,5]
[1,]    0  105  231  235  235
[2,]  105    0  234  238  238
[3,]  231  234    0   10    5
[4,]  235  238   10    0   13
[5,]  235  238    5   13    0

THe distances are relative to the string lengths so shorter strings have larger maximum distances, but for this case it looks like an upper bound of 20 would be adequate. A proper solution would use some ratio of "distance" to the nchar of that vector element.

Not offered as a finished solution, but rather more as steps 1 and 2 out of 4.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • thanks a lot. this is definitely a helpful step. I added a code using `stringdist` to identify the ones that are similar. can you think of a way to remove them from the dataset while keeping the first occurrence ? – Bahi8482 Nov 01 '20 at 03:16
0

I believe this package is what you are looking for: fuzzyjoin.

There are many fuzzy distance functions provided, but essentially two entries are "similar" if the fuzzy distance is small.

F.X.
  • 88
  • 1
  • 6
  • thanks a lot. I will study the package a bit and post some codes here if I found a solution. – Bahi8482 Nov 01 '20 at 01:34
  • The fuzzyjoin package is just "pipe-aware" application of the 'stringdist' package. – IRTFM Nov 01 '20 at 01:36
  • @IRTFM I was actually just going this question on rstudio website, https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2 which used stringdist package to answer a similar question. if you are familiar with this package, would you be able to provide a code example to solve the issue I mentioned in the question? thank you – Bahi8482 Nov 01 '20 at 01:51
  • I had already used `stringdist` to create a distance matrix for the non-duplicated rows. but I wan't seeing an elegant solution. I could see a clumsy solution so I'll post the code I have so far and maybe that will help. – IRTFM Nov 01 '20 at 01:57
  • @IRTFM Yes I agree that fuzzyjoin uses "stringdist" as backend for computing those distances, but I was thinking that "fuzzyjoin" might be easier to work with in OP's situation. I personally cannot see a nice way to achieve what OP would like, and I guess "fuzzyjoin", with all its merge capibilities may be a better choice than working with the "stringdist"? – F.X. Nov 01 '20 at 16:50