0

I am working on the data frame which contains data per row doc number and text only. This data was exported from xml file. The data is of form dataframe in variable text_df :

line/ text

 1 when uploading objective file bugzilla se
 2 spelling mistake docs section searching fo…
 3 editparams cgi won save updates iis instal…
 4 editparams cgi won save updates            
 5 rfe unsubscribe from bug you reported      
 6 unsubscribe from bug you reported  

I am using the following code to identify and remove the duplicates.

doc_set_1 = text_df
it1 = itoken(doc_set_1$text, progressbar = FALSE)

# specially take different number of docs in second set
doc_set_2 = text_df
it2 = itoken(doc_set_2$text, progressbar = FALSE)
it = itoken(text_df$text, progressbar = FALSE)
 v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max = 
 0.1, term_count_min = 5)
 vectorizer = vocab_vectorizer(v)
 dtm1 = create_dtm(it1, vectorizer)
 dtm2 = create_dtm(it2, vectorizer)
 d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
  mat<-(d1_d2_cos_sim)
  mat[lower.tri(mat,diag=TRUE)] <- 0
  ## for converting a sparse matrix into dataframe
  mdf<- as.data.frame(as.matrix(mat))
  datalist = list()
  for (i in 1:nrow(mat)) {
   t<-which(mat[i,]>0.8)
   if(length(t)>1){
   datalist[[i]] <- t # add it to your list
      }
    }

  #Number of Duplicates Found
  length(unique(unlist(datalist)))

   tmdf<- subset(mdf,select=-c(unique(unlist(datalist))))

  # Removing the similar documents
  text_df<-text_df[names(tmdf),]
  nrow(text_df)

This code takes lot of time for solving, Any suggestions to make it better are welcome.

osmjit
  • 381
  • 3
  • 10
  • Heeding https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example would go a long way in helping others help you. – hrbrmstr Nov 29 '18 at 14:24

1 Answers1

1

the library quanteda works quite well on this case. Here below I provide an example:

library(tibble)
library(quanteda)
df<- data_frame(text = c("when uploading objective file bugzilla se",
       "spelling mistake docs section searching fo",
       "editparams cgi won save updates iis instal",
       "editparams cgi won save updates",
       "rfe unsubscribe from bug you reported",
       "unsubscribe from bug you reported"))
DocTerm <- quanteda::dfm(df$text)
textstat_simil(DocTerm, margin="documents", method = "cosine")
          text1     text2     text3     text4     text5
text2 0.0000000                                        
text3 0.0000000 0.0000000                              
text4 0.0000000 0.0000000 0.8451543                    
text5 0.0000000 0.0000000 0.0000000 0.0000000          
text6 0.0000000 0.0000000 0.0000000 0.0000000 0.9128709
    textstat_simil(DocTerm, margin="documents", method = "cosine")

If one wants to subset it by an specific amount and see which ones are similar than a specific number (in this 0.9), one can do as following:

mycosinesim<-textstat_simil(DocTerm, margin="documents", method = "cosine")
myMatcosine<-as.data.frame(as.matrix(mycosinesim))
higherthan90<-as.data.frame(which(myMatcosine>0.9,arr.ind = T, useNames = T))
higherthan90[which(higherthan90$row !=higherthan90$col),]

row col
text6     6   5
text5.1   5   6

Now you can decide whether to remove 5 or 6 as text since they are really similar

Carles
  • 2,731
  • 14
  • 25
  • Thanks @Carles for reply, But I also want to remove those have similarity e.g. more than 0.9 from data frame. Please suggest that too. – osmjit Nov 29 '18 at 16:10
  • I hope the edit makes you understand how to extact it. Cheers ! :) – Carles Nov 29 '18 at 16:39
  • I appreciate the answer, but the second part would again compute intensive for for 90,000 documents I am currently working. Any other alternative that can work here. – osmjit Nov 30 '18 at 00:28
  • I am sorry @osmjit, I am not sure on how to do it to be super efficient. Nevertheless, that would be another question, quite more interesting, which is: how to etract efficiently indexes from big data.frames(). Please, close the question since the question is answered. Cheers! – Carles Nov 30 '18 at 10:39
  • I have found that this which could help you do it faster :). https://stackoverflow.com/questions/28233561/finding-rows-containing-a-value-or-values-in-any-column; i hope it helps you out ! – Carles Nov 30 '18 at 11:01