4

I would first like to say that I am using tweepy. I found a way to filter out the same string but I am having a hard time filtering out similar strings.

I have two sentence strings that I need to compare (Tweepy keyword ="Donald Trump")

String 1: "Trump Administration Dismisses Surgeon General Vivek Murthy (http)PUGheO7BuT5LUEtHDcgm"

String 2: "Trump Administration Dismisses Surgeon General Vivek Murthy (http)avGqdhRVOO"

As you can see they are similar but not the same. I needed to find a way to compare the two and get a number value to decide if the second tweet should be added to the first. I thought I had the solution when I used SequenceMatcher() but it always printed out 0.0. I was expecting it to be greater than 0.5. However Sequence Matcher only seems to work for one word strings (correct me if I am wrong).

Now you are probably thinking, "just splice off the http portions". That won't work either because it does not account for people tweet names like @cars: xyz zyx and @trucks: xyz zyx

Is there some way to compare the two texts? It should be simple but for some reason the solution eludes me. I just learned python a week ago. Still feels weird using indents to discern between what's in a function or not.

lambda
  • 3,295
  • 1
  • 26
  • 32
LuxLunae
  • 67
  • 1
  • 1
  • 5
  • There are a ton of tools in the [jellyfish](https://github.com/jamesturk/jellyfish) package. (I am not affiliated with that project.) – Arya McCarthy Apr 22 '17 at 16:50

2 Answers2

18

You can use SequenceMatcher().ratio() from difflib, i.e:

from difflib import SequenceMatcher

a = "I love Coding"
b = "I love Codiing"

ratio = SequenceMatcher(None, a, b).ratio()
# 0.9629629629629629

Demo

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • 1
    I forgot to put the "None" portion in SequenceMatcher () function!!! Thanks for helping me see that quickly lol. I was sitting here for 2-3 hrs trying to figure out what I was doing wrong. – LuxLunae Apr 22 '17 at 17:03
0

What you are looking for here is the edit distance between two strings. The edit distance means the minimal number of substitutions, deletions and insertion required on one string to get the other. This is usually implemented using dynamic programming. It's actually a pretty cool interview question/project to do to test your programming skills.

Here are a few implementations in python along with some description.

User aryamccarthy has already mentioned the jellyfish library which already implements this functionality (Levenshtein Distance) and has much more interesting tools that deal with matching strings. Definitely worth a look.

Community
  • 1
  • 1
PeskyPotato
  • 670
  • 8
  • 20