I have a system where information can come from various sources. I want to make sure I don't add exact (or extremely similar) pieces of information. Here is an example:
Text A: One day a man walked over the hill and saw the sun
Text B: One day a man walked over a hill and saw the sun
Text C: One week a woman looked over a hill and saw the sun
In this case I want to get some sort of numerical value for the difference between the blocks of information. From there I can apply the following logic:
- When adding Text to database, check for existing values in database
- If values are seen to be very similar then do not add
- If values are seen to be different enough, then do add
Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway.
Can anyone tell me how I might attempt this in Python?