1

I have a table which has two columns: ID (primary key, auto increment) and keyword (text, full-text index). The values entered in the keyword column include the following: keyword

  • Car
  • Car sales
  • Cars
  • Sports cars
  • Sports foo
  • Car bar
  • Statistics

Suppose that we have this sentence as an input:

"Find sports car sales statistics in Manhattan."

I'm looking (and I have been searching for quite a while) to find either a MySQL query or an algorithm which takes in the given input, and detects the keywords used from the keywords column, resulting in an output of:

"Sports cars", "Car sales", "Statistics"

In other words, I'm trying to take an input that's in the form of a sentence, and then match all the existing (and most relevant) keyword values in the database that are found in the sentence. Note that these keywords could be phrases that consist of words separated by a space.

After researching I got to know that MySQL does a similar job through its full-text search feature. I have tried all the natural language, boolean, and query expansion options, but they include keyword records that only have half of its contents matching with the input. For example, it outputs:

"Car", "Car sales", "Sports cars", "Sports foo", "Cars bar", "Statistics".

I don't want this to happen because it includes words that aren't even in the input (i.e. foo and bar).

Here's the MySQL query for the above mentioned search:

SELECT * FROM tags WHERE MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE)

I also tried to improve on the relevancy, but this one only returns a single record:

SELECT *, SUM(MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE)) as score FROM tags WHERE MATCH(keyword) AGAINST('Find sports car sales statistics in Manhattan.' IN BOOLEAN MODE) ORDER BY score DESC
  • Can you add a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) to your question?it can make your question more clear ? and you'll get a better answer! – Mazdak May 16 '15 at 13:52
  • I know I've written a bit too much, but I guess it's needed to convey the question. – Mohammad Tomaraei May 16 '15 at 13:53
  • 1
    As you may know, questions like this are off topic in SO,although you have explained nice you need to add your code that you have tried so far! – Mazdak May 16 '15 at 13:55
  • MySQL returns `Car` as well, which is OK to get, but since there is another word `Sales` that matches a combination, I would like it to be given a lower priority. It doesn't matter that much though. – Mohammad Tomaraei May 16 '15 at 14:19
  • And are you sure that you want `Sports cars` while you just have `Sports car` in DB? – Mazdak May 16 '15 at 14:19
  • Exactly, that's one of my points. I thought of using the Levenshtein Distance formula to check for similar words (i.e. plural, singular), but I thought MySQL does that in a more efficient way (even though it gives me extra stuff). – Mohammad Tomaraei May 16 '15 at 14:22
  • Yes Levenshtein is one way but how MySQL does that in a more efficient way? – Mazdak May 16 '15 at 14:24
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77967/discussion-between-mohammad-reza-and-kasra). – Mohammad Tomaraei May 16 '15 at 14:24

1 Answers1

1

If we suppose that you have your column in a list as a pythonic way for such tasks you can use set.intersection to get the intersection between two set (the second element could be another iterables like list or tuple) :

>>> col={'Car','Car sales','Cars','Sports cars','Sports foo','Car bar','Statistics'}
>>> col={i.lower() for i in col}
>>> s="Find sports car sales statistics in Manhattan."
>>> col.intersection(s.strip('.').split())
set(['car', 'statistics'])

And in your case you can put the result of your query within a set or convert it to set.

Note : the following set comprehension will convert the elements if your column to lower case :

>>> col={i.lower() for i in col}

But this recipe will find the intersection between your column and the splitted string with white spaces. so the result will be :

set(['car', 'statistics'])

As another way you can use re.search :

>>> col={'Car','Car sales','Cars','Sports cars','Sports foo','Car bar','Statistics'} 
>>> s='Find sports car sales statistics in Manhattan.'
>>> for i in col:
...    g=re.search('{}'.format(i),s,re.IGNORECASE)
...    if g:
...      print g.group(0)
... 
statistics
car sales
car

As a simple way you can use a function like following to get a combinations of your phrases :

from itertools import permutations
def combs(phrase):
    sp=phrase.split()
    com1=[map(lambda x:' '.join(x),li) for li in [permutations(sp,j) for j in range(1,len(sp)+1)]]
    for i,k in enumerate(sp):
          if not k.endswith('s'):
             sp[i]=k+'s'
    com2=[map(lambda x:' '.join(x),li) for li in [permutations(sp,j) for j in range(1,len(sp)+1)]]
    return com1+com2

print {j for i in combs('Car sales') for j in i}
set(['Car', 'sales', 'sales Cars', 'Car sales', 'Cars sales', 'sales Car', 'Cars'])

Note that this function could be more efficient and complete.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • These two approaches are absolutely good points to consider, but considering it doesn't recognize the word "Sports" even though it was used, I guess it must do more processing. – Mohammad Tomaraei May 16 '15 at 14:23
  • Thanks for all the approaches, they're really helpful. I actually found out an answer that is quite close to what I'm looking for (a good start point): http://stackoverflow.com/questions/5859561/getting-the-closest-string-match MySQL also implements a similar MATCH algorithm that I have to look at. – Mohammad Tomaraei May 16 '15 at 15:13
  • @MohammadReza Glad about hearing that! i think its more efficient that use `MySQL` for this task! – Mazdak May 16 '15 at 15:24