Remove duplicating docs of docs with high similarity

Question

When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?

Ken Benoit · Answer 1 · 2018-11-14T23:00:26.580

Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.

Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.

We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.

library("quanteda")
# Loading required package: quanteda
# Package version: 1.3.14

mydocs <- c(a1 = "a a a a a b b c d w g j t",
            b1 = "l y y h x x x x x y y y y",
            a2 = "a a a a a b c s k w i r f",
            b2 = "p q w e d x x x x y y y y",
            a3 = "a a a a a b b x k w i r f")

mydfm <- dfm(mydocs)

(sim <- textstat_simil(mydfm))
#             a1          b1          a2          b2
# b1 -0.22203788                                    
# a2  0.80492203 -0.23090513                        
# b2 -0.23427416  0.90082239 -0.28140219            
# a3  0.81167608 -0.09065452  0.92242890 -0.12530944

# create a data.frame of the unique pairs and their similarities
sim_pair_names <- t(combn(docnames(mydfm), 2))
sim_pairs <- data.frame(sim_pair_names,
                        sim = as.numeric(sim), 
                        stringsAsFactors = FALSE)
sim_pairs
#    X1 X2         sim
# 1  a1 b1 -0.22203788
# 2  a1 a2  0.80492203
# 3  a1 b2 -0.23427416
# 4  a1 a3  0.81167608
# 5  b1 a2 -0.23090513
# 6  b1 b2  0.90082239
# 7  b1 a3 -0.09065452
# 8  a2 b2 -0.28140219
# 9  a2 a3  0.92242890
# 10 b2 a3 -0.12530944

Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().

# set the threshold for similarity
threshold <- 0.80

# discard one of the pair if similarity > threshold
todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
todrop
# [1] "a1" "a1" "b1" "a2"

# then subset the dfm, keeping only the "keepers"
dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
# 2 x 20 sparse Matrix of class "dfm"
#     features
# docs a b c d w g j t l y h x s k i r f p q e
#   b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
#   a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0

Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.

Wow, this is really awesome. I'm sorry for not coming up with a reproducable example, but I didn't know how to create a data frame with pairs. Again. Thanks for this. — fritsvegters, Nov 15 '18 at 14:40

score 1 · Answer 2 · answered Nov 17 '18 at 12:41

If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().

In this example, cosine similarity scores smaller than 0.9 are all ignored.

library("quanteda")
mydocs <- c(a1 = "a a a a a b b c d w g j t",
            b1 = "l y y h x x x x x y y y y",
            a2 = "a a a a a b c s k w i r f",
            b2 = "p q w e d x x x x y y y y",
            a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)

(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
# 5 x 5 sparse Matrix of class "dsTMatrix"
#    a1        b1        a2        b2        a3    
# a1  1 .         .         .         .        
# b1  . 1.0000000 .         0.9113423 .        
# a2  . .         1.0000000 .         0.9415838
# b2  . 0.9113423 .         1.0000000 .        
# a3  . .         0.9415838 .         1.0000000

matrix2list <- function(x) {
    names(x@x) <- rownames(x)[x@i + 1]
    split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}

matrix2list(sim)
# $a1
# a1 
#  1 
# 
# $b1
# b1 
#  1 
# 
# $a2
# a2 
#  1 
# 
# $b2
#        b1        b2 
# 0.9113423 1.0000000 
# 
# $a3
#        a2        a3 
# 0.9415838 1.0000000

See https://koheiw.net/?p=839 for the performance differences.

score 1 · Accepted Answer · answered Feb 18 '19 at 11:41

You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.

The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).

I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().

Remove duplicating docs of docs with high similarity

3 Answers3