0

I have a system where information can come from various sources. I want to make sure I don't add exact (or extremely similar) pieces of information. Here is an example:

Text A: One day a man walked over the hill and saw the sun

Text B: One day a man walked over a hill and saw the sun

Text C: One week a woman looked over a hill and saw the sun

In this case I want to get some sort of numerical value for the difference between the blocks of information. From there I can apply the following logic:

  1. When adding Text to database, check for existing values in database
  2. If values are seen to be very similar then do not add
  3. If values are seen to be different enough, then do add

Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway.

Can anyone tell me how I might attempt this in Python?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Jimmy
  • 12,087
  • 28
  • 102
  • 192

4 Answers4

2

Looking at your problem, difflib.SequenceMatcher.ratio() might come handy.

This nifty routine, takes two strings and calculates a similarity index in the range [0,1]

Quick Demo

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Abhijit
  • 62,056
  • 18
  • 131
  • 204
1

There are a couple of python libraries that can help you with that. Have a look at this Q:.

The levisthein distance is a common algorithm. I found the nysiis algorithm very useful. Especially if you want to save a string representation in a DB.

This link will give you an excellent overview:

Community
  • 1
  • 1
LarsVegas
  • 6,522
  • 10
  • 43
  • 67
1

A primitive way of doing this... but you could iterate through strings, comparing the equivalent sequential word in another string and you get a ratio of matches to fails:

>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12

So in this example, you can see 11/12 words matched. You can then set a pass / fail level

Noel Evans
  • 8,113
  • 8
  • 48
  • 58
0

In python or any other language hashes are the easiest way to remove duplicates.

You can maintain a table of already added hashes. when you add another just check if hash is present or not.

Use hashlib for it

Adding hashlib usage example

import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()

m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()

m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()

Ans

d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b
duck
  • 2,483
  • 1
  • 24
  • 34
  • Can you please explain how to get an idea of the differences in the hashes. It just looks like a long number and I don't know how you could numerically compare the two – Jimmy Aug 22 '13 at 11:47
  • It's actaully impossible. Hashes doesn't give you any information on the object itself. Two completely different string can have the same hash value while two string with one char difference can have totally different hash. Ex using md5: aaaa => 74b87337454200d4d33f80c4663dc5e5 aaab => 4c189b020ceb022e0ecc42482802e2b8. Using hashing, you can only tell if two object are different or not (not equal, since as I already said, two different objects can have the same hash but two object with different hashes cannot be equal) – Samy Arous Aug 22 '13 at 12:30
  • @duck Following what I just said, if you want to see if an object already exists, using hashes will only give you a potential match. You still need to do a real check before you can decide if both objects are the same. (yes I know, it's unlikely but it happens and it can cause huge bugs) – Samy Arous Aug 22 '13 at 12:33