5

Let's say that my program receives an input such as a string of characters that has any type of character. For example, 'Bob's Bagel Shop'. Then it gets another string that says 'Fred's Bagel Store'. How can I use regular expressions or some other module in python to compare these and have my program tell me if at least 5 (or any number I want) of the characters are the same anywhere in the string, but all in the same order, such as the word 'Bagel'?

Thanks.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
Marcus Johnson
  • 2,505
  • 6
  • 22
  • 27
  • Would it be useful to simply compare how many words are the same? It would be much much more efficient than testing for any five characters! – Billy Moon Aug 12 '12 at 18:19
  • @BillyMoon The problem is...these are usually business names (like McDonald's or something haha)...so it could have characters in it..but yes, it would probably be simpler. – Marcus Johnson Aug 12 '12 at 18:20
  • 1
    you could use regex to determine what you consider a word to be (including special characters etc...), and then simply check each word in the first string against each word in the second. – Billy Moon Aug 12 '12 at 18:22
  • `sBagelS` should be the longest same string for this example? – Ashwini Chaudhary Aug 12 '12 at 18:30
  • Uh...yeah... 's Bagel S' would be it I guess. – Marcus Johnson Aug 12 '12 at 18:33

3 Answers3

12

There's a Python standard library class difflib.SequenceMatcher that will help to solve your problem. Here's a code sample:

from difflib import SequenceMatcher

s1 = "Bob's Bagel Shop"
s2 = "Bill's Bagel Shop"

matcher = SequenceMatcher(a=s1, b=s2)
match = matcher.find_longest_match(0, len(s1), 0, len(s2))

Result:

Match(a=3, b=4, size=13)  # value that 'match' variable holds

The result shows that both string has equal substring with 13 characters length (starting from 3-rd char in first string and 4-th char in second string).

You can use this match result object to get its fields as values:

match.size  # 13
match.a     # 3
match.b     # 4
Rostyslav Dzinko
  • 39,424
  • 5
  • 49
  • 62
1

you can use itetools.combinations and then use intersection of sets to find out matching characters from both strings:

from itertools import combinations
str1="Bob's Bagel Shop"
str2="Fred's Bagel Store"

def combi(strs):
    chars=''.join(strs.split())
    lis=[]
    for x in range(1,len(chars)):
        for y in combinations(chars,x):
            if ''.join(y) in chars:
                lis.append(''.join(y))
    return lis           


lis1=combi(str1)
lis2=combi(str2)
print max(set(lis1).intersection(set(lis2)),key=len)  

output:

'sBagelS
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
0

See

String similarity metrics in Python

or checkout the simhash module:

http://bibliographie-trac.ub.rub.de/browser/simhash.py

Community
  • 1
  • 1