I am working on the data frame which contains data per row doc number and text only. This data was exported from xml file. The data is of form dataframe in variable text_df
:
line/ text
1 when uploading objective file bugzilla se
2 spelling mistake docs section searching fo…
3 editparams cgi won save updates iis instal…
4 editparams cgi won save updates
5 rfe unsubscribe from bug you reported
6 unsubscribe from bug you reported
I am using the following code to identify and remove the duplicates.
doc_set_1 = text_df
it1 = itoken(doc_set_1$text, progressbar = FALSE)
# specially take different number of docs in second set
doc_set_2 = text_df
it2 = itoken(doc_set_2$text, progressbar = FALSE)
it = itoken(text_df$text, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max =
0.1, term_count_min = 5)
vectorizer = vocab_vectorizer(v)
dtm1 = create_dtm(it1, vectorizer)
dtm2 = create_dtm(it2, vectorizer)
d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
mat<-(d1_d2_cos_sim)
mat[lower.tri(mat,diag=TRUE)] <- 0
## for converting a sparse matrix into dataframe
mdf<- as.data.frame(as.matrix(mat))
datalist = list()
for (i in 1:nrow(mat)) {
t<-which(mat[i,]>0.8)
if(length(t)>1){
datalist[[i]] <- t # add it to your list
}
}
#Number of Duplicates Found
length(unique(unlist(datalist)))
tmdf<- subset(mdf,select=-c(unique(unlist(datalist))))
# Removing the similar documents
text_df<-text_df[names(tmdf),]
nrow(text_df)
This code takes lot of time for solving, Any suggestions to make it better are welcome.