12

I've built up a large database of banks in MongoDB. I can easily take this information and create indexes with it in whoosh. For example I'd like to be able to match the bank names 'Eagle Bank & Trust Co of Missouri' and 'Eagle Bank and Trust Company of Missouri'. The following code works with simple fuzzy such, but cannot achieve a match on the above:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()

test_items = [u"Eagle Bank and Trust Company of Missouri"]

writer.add_document(name=item)
writer.commit()

from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm

with ix.searcher() as s:
    qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
    q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
    results = s.search(q)
    print results

gives me:

<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>

Is it possible to achieve what I want with Whoosh? If not what other python based solutions do I have?

Assem
  • 11,574
  • 5
  • 59
  • 97
ciferkey
  • 2,064
  • 3
  • 20
  • 28

5 Answers5

11

You could match Co with Company using Fuzzy Search in Whoosh but You shouldn't do because the difference between Co and Company is large. Co is similar to Company as Be is similar to Beast and ny to Company, You can imagine how bad and how large will be the search results.

However, if you want to match Compan or compani or Companee to Company you could do it by using a Personalized Class of FuzzyTerm with default maxdist equal to 2 or more :

maxdist – The maximum edit distance from the given text.

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

Then:

 qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)

You could match Co with Company by setting maxdist to 5 but this as I said give bad search results. I suggest to keep maxdist from 1 to 3.

If you are looking for matching a word linguistic variations, you better use whoosh.query.Variations.

Note: older Whoosh versions has minsimilarity instead of maxdist.

Assem
  • 11,574
  • 5
  • 59
  • 97
3

For future reference, and there must be a better way to do this somehow, but here's my shot.

# -*- coding: utf-8 -*-
import whoosh
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.query import *
from whoosh.qparser import QueryParser

schema = Schema(name=TEXT(stored=True))
idx = create_in("C:\\idx_name\\", schema, "idx_name")

writer = idx.writer()

writer.add_document(name=u"This is craaazy shit")
writer.add_document(name=u"This is craaazy beer")
writer.add_document(name=u"Raphaël rocks")
writer.add_document(name=u"Rockies are mountains")

writer.commit()

s = idx.searcher()
print "Fields: ", list(s.lexicon("name"))
qp = QueryParser("name", schema=schema, termclass=FuzzyTerm)

for i in range(1,40):
    res = s.search(FuzzyTerm("name", "just rocks", maxdist=i, prefixlength=0))
    if len(res) > 0:
        for r in res:
            print "Potential match ( %s ): [  %s  ]" % ( i, r["name"] )
        break
    else:
        print "Pass: %s" % i

s.close()
trokster
  • 31
  • 2
1

Perhaps some of this stuff might help (string matching open sourced by the seatgeek guys):

https://github.com/seatgeek/fuzzywuzzy

malangi
  • 2,692
  • 4
  • 28
  • 42
0

For anyone stumbling across this question more recently, it looks like they've added fuzzy support natively, though it'd take a bit of work to satisfy the particular use case outlined here: https://whoosh.readthedocs.io/en/latest/parsing.html

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 22 '22 at 03:13
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/32073661) – BrokenBenchmark Jun 25 '22 at 15:13
-3

You could use this function below to fuzz search a set of words against a phrase:

def FuzzySearch(text, phrase):
    """Check if word in phrase is contained in text"""
    phrases = phrase.split(" ")

    for x in range(len(phrases)):
        if phrases[x] in text:
            print("Match! Found " + phrases[x] + " in text")
        else:
            continue
Hazim Sager
  • 82
  • 10