0

I am trying to implement searching a set of multiple words which include multiple parts. For example, we have these medical terminologies.

R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE

User can enter search terms like this:

e.g., "R De Me Li" then this should find "R Deep Transverse Metatarsal Ligament 4 GEODE"

e.g., "Fi Colla" ==> "R Fibular Collateral Bursa GEODE", "R Fibular Collateral Ligament GEODE"

e.g., "bow ODE" ==> "R Elbow Capsule GEODE"

That is, even when user enters some portions of a word, it should find the answers. If there are multiple answers, it should show all. I appreciate your help in advance.

Added) Oh.. I forgot something.

e.g., "ral lar" ==> It shouldn't show "R Fibular Collateral Bursa GEODE" or "R Fibular Collateral Ligament GEODE" since the order of query words should be considered.

In addition, the spaces among the query words mean different words of each line (database).

The order of query words should be the same as the words of each line (database), but the query words could be shorter than the database words.

e.g., "R De Me 4" ==> "R Deep Transverse Metatarsal Ligament 4 GEODE" Where we can see that 'Metatarsal' and 'Ligament' include 'me', but the first match with 'Metatarsal' is fine, and 4 will be searched.

In addition, different combinations of query words can return the same result.

e,g.,

'Car' ==> 'R Dorsal Carpometacarpal Ligament 2 GEODE'

'Do Car' ==> 'R Dorsal Carpometacarpal Ligament 2 GEODE'

'R Do Carp' ==> 'R Dorsal Carpometacarpal Ligament 2 GEODE'

Note: no case-sensitive.

Sam Kim
  • 35
  • 6
  • I looked up 'Python v2.7.3' references, and webpages, and tried lots of things, but I couldn't make it. Basically, I am not good at regular expressions yet. – Sam Kim Jun 07 '12 at 16:42
  • 1
    If there is one answer that responded to your problem, maybe you can consider to "accept" it ? – cedbeu Jun 09 '12 at 14:08

4 Answers4

4

You can do this with difflib in the standard distribution:

import difflib

s="""R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE""".split('\n')

qs="""R De Me Li
Fi Colla
bow ODE""".split('\n')

for q in qs:
    print "results for '{}':".format(q)
    matches=difflib.get_close_matches(q,s,3,0.3)
    for i,e in enumerate(matches,1):
        print "\t{}. {}".format(i,e)

Prints:

results for 'R De Me Li':
    1. R Deep Transverse Metatarsal Ligament 4 GEODE
    2. R Dorsal Calcaneocuboid Ligament GEODE
    3. R Dorsal Cuboideavicular Ligament GEODE
results for 'Fi Colla':
    1. R Fibular Collateral Bursa GEODE
    2. R Fibular Collateral Ligament GEODE
results for 'bow ODE':
    1. R Elbow Capsule GEODE

Combining with cblab's answer on combining regex's with difflib, you can get this:

s="""R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE""".split('\n')
s=set(s)
qs="""R De Me Li
Fi Colla
bow ODE
Car
Do Car
ral lar
R De Me 4
R Do Carp""".split('\n')

for q in sorted(qs):
    print "results for '{}':".format(q)
    pattern = r'.*' + re.sub(r'\W', '.*', q.strip()) + '.*'
    matches=[item for item in s if re.match(pattern, item, re.I)]
    for e in difflib.get_close_matches(q,s,3,0.33):
        if e not in matches: 
            matches.append(e)

    for i,e in enumerate(matches,1):
        print "\t{}. {}".format(i,e)
    else:
        if len(matches)==0:
            print "\tNo matches"    

Prints:

results for 'Car':
    1. R Dorsal Carpometacarpal Ligament 2 GEODE
results for 'Do Car':
    1. R Dorsal Carpometacarpal Ligament 2 GEODE
results for 'Fi Colla':
    1. R Fibular Collateral Bursa GEODE
    2. R Fibular Collateral Ligament GEODE
results for 'R De Me 4':
    1. R Deep Transverse Metatarsal Ligament 4 GEODE
results for 'R De Me Li':
    1. R Deep Transverse Metatarsal Ligament 4 GEODE
    2. R Dorsal Calcaneocuboid Ligament GEODE
results for 'R Do Carp':
    1. R Dorsal Carpometacarpal Ligament 2 GEODE
    2. R Elbow Capsule GEODE
    3. R Distal JointCapsule 1 GEODE
results for 'bow ODE':
    1. R Elbow Capsule GEODE
results for 'ral lar':
    No matches
Community
  • 1
  • 1
dawg
  • 98,345
  • 23
  • 131
  • 206
3

A simple pythonic solution that does the job as answered and that is not case sensitive:

import re

def search(request, base):
    pattern = r'.*' + re.sub(r'\W', '.*', request.strip()) + '.*'
    return [item for item in base if re.match(pattern, item, re.I)]

Basically, we create a simple regular expression that matches any string that contains all the substrings of the request (all that is separated by non-word character) in the original order, with anything before, in the between and after.

For instance, a request 'R De Me Li' becomes a pattern r'.*R.*De.*Me.Li.*'

Then, we return a list of all matching results. It's not case sensitive thanks to the flag re.I in the re.match().

Then, it works as expected, you can try with the base:

>>> base = ['R Deep Transverse Metatarsal Ligament 4 GEODE',
'R Distal JointCapsule 1 GEODE',
'R Dorsal Calcaneocuboid Ligament GEODE',
'R Dorsal Carpometacarpal Ligament 2 GEODE',
'R Dorsal Cuboideavicular Ligament GEODE',
'R Dorsal Tarsometatarsal Ligament 5 GEODE',
'R Elbow Capsule GEODE',
'R F Distal JointCapsule 1 GEODE',
'R Fibular Collateral Bursa GEODE',
'R Fibular Collateral Ligament GEODE',
'R Fibular Ligament GEODE']

Some example requests:

>>> search('R De Me Li', base)
['R Deep Transverse Metatarsal Ligament 4 GEODE']
>>> search('Fi Colla', base)
['R Fibular Collateral Bursa GEODE', 'R Fibular Collateral Ligament GEODE']
>>> search('bow ODE', base)
['R Elbow Capsule GEODE']
>>> search('Car', base)
['R Dorsal Carpometacarpal Ligament 2 GEODE']
>>> search('F', base)
['R F Distal JointCapsule 1 GEODE', 'R Fibular Collateral Bursa GEODE', 'R Fibular Collateral Ligament GEODE', 'R Fibular Ligament GEODE']
>>> search('F Ca', base)
['R F Distal JointCapsule 1 GEODE']
>>> search('F Co', base)
['R Fibular Collateral Bursa GEODE', 'R Fibular Collateral Ligament GEODE']

note: it will match only if the order is the same in the request and in the item (i.e. 'ode bow' as a request won't match ['R Elbow Capsule GEODE'], while 'bow ode' will).

note: I don't think fuzzy search would help a lot here, at least at first, since it's based on distance such as Levenshtein's (edit distance), which would be very big between, say, 'Fi' and 'Fibular' (distance of 5 in a word of 7 ... at 35% I'm not sur it's a good idea to match ... You could use it if you were pretty sure the request contains only full words with possible little mistyping)

cedbeu
  • 1,919
  • 14
  • 24
  • Per "I don't think fuzzy search would help a lot here": so long as edit distance between "fi" and "fibular" is less than between "fi" and "Dorsal Tarsometatarsal", it works - it doesn't have to be a "good" match, just "better than the others". – Hugh Bothwell Jun 07 '12 at 03:29
  • Look, using the distance method, when you search for 'Car' -> 'R Fibular Collateral Bursa GEODE', now 'Do Car' -> 'R Elbow Capsule GEODE', 'R Do Carp' -> 'R Elbow Capsule GEODE', and so on and so forth ... And yet, we are on a very small dataset. I am not sure that this is the wanted behavior. When we search for 'Car' it should return 'R Dorsal Carpometacarpal Ligament 2 GEODE', for 'Do Car' or 'R Do Carp' as well. That is the way I understood the problem, though. – cedbeu Jun 07 '12 at 04:29
  • I think fuzzy search could help to improve a more advanced search algorithm with a solid full text search basis, though. But it's not a so straightforward solution. But anyway, difflib seems to be a good alternative to not reinvent the wheel. – cedbeu Jun 07 '12 at 05:01
  • 1
    cblab, your understanding about the search behavior is correct. I added some explanation to my original question. Thanks all of you for the help. :) – Sam Kim Jun 07 '12 at 17:58
  • Ok, you're welcome. This should work exactly as expected, then. If you want some more results (approximations, for example for some suggestions if there is no exact results, then you could use the smart solution with difflib as shown by @drewk above – cedbeu Jun 07 '12 at 18:15
1

Not really a "regular expression" problem; you should be looking at fuzzy comparison of strings, ie Levenshtein distance or diff.

See https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

Edit: Some sample code:

import Levenshtein

base_strings = [
    "R Deep Transverse Metatarsal Ligament 4 GEODE",
    "R Distal JointCapsule 1 GEODE",
    "R Dorsal Calcaneocuboid Ligament GEODE",
    "R Dorsal Carpometacarpal Ligament 2 GEODE",
    "R Dorsal Cuboideavicular Ligament GEODE",
    "R Dorsal Tarsometatarsal Ligament 5 GEODE",
    "R Elbow Capsule GEODE",
    "R F Distal JointCapsule 1 GEODE",
    "R Fibular Collateral Bursa GEODE",
    "R Fibular Collateral Ligament GEODE",
    "R Fibular Ligament GEODE"
]

def main():
    print("Medical term matcher:")
    while True:
        t = raw_input('Match what? ').strip()
        if len(t):
            print("Best match: {}".format(sorted(base_strings, key = lambda x: Levenshtein.ratio(x, t), reverse=True)[0]))
        else:
            break

if __name__=="__main__":
    main()

Actual output:

Medical term matcher:
Match what? R De Me Li
Best match: R Deep Transverse Metatarsal Ligament 4 GEODE
Match what? Fi Colla
Best match: R Fibular Collateral Bursa GEODE
Match what? bow ODE
Best match: R Elbow Capsule GEODE
Match what? 

Edit 2: "If there are multiple answers, it should show all" - the basis strings are all answers to varying degrees. The question, then, is what sort of similarity-value cutoff you want to use; maybe something like "all answers at least 90% as good as the best match"?

Community
  • 1
  • 1
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
1

The following code pressuposes you want to consider a "match" when you have all the particles (pieces of strings separated by white space in the input) to be present in the result. I had used loops in the example, but of course you should adapt it to use raw_input.

Although it uses regex (to allow for multiple white space), the main feature used is the if particle in line:

import re

entry = """R Deep Transverse Metatarsal Ligament 4 GEODE
R Distal JointCapsule 1 GEODE
R Dorsal Calcaneocuboid Ligament GEODE
R Dorsal Carpometacarpal Ligament 2 GEODE
R Dorsal Cuboideavicular Ligament GEODE
R Dorsal Tarsometatarsal Ligament 5 GEODE
R Elbow Capsule GEODE
R F Distal JointCapsule 1 GEODE
R Fibular Collateral Bursa GEODE
R Fibular Collateral Ligament GEODE
R Fibular Ligament GEODE
"""

searches = """R De Me Li
Fi Colla
bow ODE"""

for search in searches.split('\n'):
    print search, ':'
    termlist = re.split('\s', search)
    for line in entry.split('\n'):
        match = True
        for term in termlist:
            if not term in line:
                match = False
        if match:
            print '\t', line
    print
heltonbiker
  • 26,657
  • 28
  • 137
  • 252