5

I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example:

  • Text 1:"I'm writing a crawler to"
  • Text 2:"I'm writing a some text crawler to get"

The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some text" then text 2 as the same text 1(I need detect the situation).How can I do that?

fgregg
  • 3,173
  • 30
  • 37
user3508779
  • 59
  • 1
  • 4
  • 2
    Close voters: This is a legitimate Information-Retrieval question. Not all questions on SO should be "how to parse a string in c#?" There is nothing wrong with a question that asks how to do something - that can later be implemented in any programming language. – amit Apr 14 '14 at 06:59
  • Look into https://en.wikipedia.org/wiki/SimHash – amirouche Aug 14 '20 at 13:09

3 Answers3

9

You are facing a problem which is known in the field of Information Retrieval as Near Duplicates Detection.

One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings, instead sets of only words.
In your example, with k=2, the sets will be:

s1 = { I'm write, write a, a crawler, crawler to }
s2 = { I'm write, write a, a some, some text, text crawler, crawler to, to get }
s1 [union] s2 = { I'm write, write a, a crawler, crawler to, a some, some text, text crawler, to get } 
s1 [intersection] s2 = { I'm write, write a, crawler to }

In the above, the jaccard-similarity will be 3/8. If you use single words with the same approach, (k=1 shinglings) you will get your desired 5/8 - but this is worse solution in my (and most IR experts) opinion.

This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture few months ago, based on the author's notes).

amit
  • 175,853
  • 27
  • 231
  • 333
  • Any thoughts about the hashing algorithms compared to this? – David Kong Aug 21 '20 at 12:53
  • @amit: The link is dead. Are your lecture notes still available somewhere else? Or maybe there is a more recent lecture by now you can point us to? Thanks! – binford Apr 10 '22 at 18:28
2

A good algorithm for comparing two text is tf-idf. It will give similarity between two documents.

1. calculate tf-idf for the document
2. calculate cosine similarity for two given text
3. the cosine similarity will indicate match between two documents.

This is a very good tutorial for calculating tf-idf and cosine similarity in Java. It would be simple to extend it to C#.

coder hacker
  • 4,819
  • 1
  • 25
  • 50
  • 2
    The problem with tf-idf and cosine-similarity is it does not scale nicely. You will end up having to compare between every two pair of documents (which is usually in the scale of 10^12-10^20). This is why the 'less informative' jaccard-similarity is usually prefered, it turns out to be much easier to compute, and does not require going over all pairs of documents. – amit Apr 14 '14 at 07:23
  • @amit I was just listing the possibility. Its good practice to be informed about all possible ways to achieve. – coder hacker Apr 14 '14 at 07:25
  • 1
    Of course it is, I did not downvote or anything, just mentioned it's disadvantage as well, it is also important to know the problems with the different possible options. – amit Apr 14 '14 at 07:26
0

In bioinformatics there is a algorithm which should do the job. It is called Needleman-Wunsch and is normally used for global sequence alignment with nucletide sequences.

Using this algorithm you could easily calculate accordance between two strings. You can use my code. But this method only returns the alignment you would have to calculate the accordance yourself.

チーズパン
  • 2,752
  • 8
  • 42
  • 63