4

How can I match variations on words in MySQL, for example a search for accountancy should match accountant, accountants, accounting etc. I'm on shared hosting so can't add any functions to MySQL such as levenshtein.

I want something similar to how Google matches 'accounting course' and 'accountancy courses' when searching for 'accountant courses'. Example.

My server language is php, if it's only possible to implement it there and not in SQL.

The current statement is as follows.

SELECT 
  pjs.title,
  MATCH (pjs.title) AGAINST ('accountancy' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION) AS rel1,
  MATCH (pjs.description) AGAINST ('accountancy' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION) AS rel2,
  MATCH (
    pjs.benefits,
    pjs.experienceRequirements,
    pjs.incentives,
    pjs.qualifications,
    pjs.responsibilities,
    pjs.skills
  ) AGAINST ('accountancy' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION) AS rel3
FROM
  pxl_jobsearch AS pjs 
ORDER BY (rel1 * 5) + (rel2 * 1.5) + (rel3) DESC;
Ric
  • 3,195
  • 1
  • 33
  • 51

5 Answers5

3

MySQL isn't very good at full text search and you'd probably want to use other engines. My favorite one is Sphinx (http://sphinxsearch.com/) but there are others as well. Most of these support stemming out of the box.

If you have large tables and are going to use stemming the performance of MySQL will probably be very bad.

If you can't use Sphinx, take a look at this php script http://tartarus.org/~martin/PorterStemmer/php.txt

With this you can use stemming, and the search on the stemmed words.

Nin
  • 2,960
  • 21
  • 30
  • thanks. I'd love to use Sphinx, looked briefly into it before but stuck on shared hosting so no can do this time. I don't see data size being a problem with this site. Hopefully, if it takes off I will be able to improve the hosting in the future. – Ric Sep 11 '12 at 11:30
  • Can you install a PECL package: http://pecl.php.net/package/stem Or can you check if it is installed? Never mind, you need snowball as well. – Nin Sep 11 '12 at 11:52
  • Added a php stemming script, you should be able to use that in a shared environment. Do note that it's probably not perfect. – Nin Sep 11 '12 at 12:04
  • Thanks @Nin I awarded you the bounty as Sphinx (or similar) would be the best answer in ideal circumstances, but the link to the stemmer is a good fallback. – Ric Sep 13 '12 at 08:31
  • +1 for the link to the PHP stemmer - made my life a whole lot easier – Dave Hilditch Feb 07 '14 at 04:04
1

The SOUNDEX() function of MySQL comes pretty close. Read more about it here.

Example:

create table test(id int auto_increment, a varchar(255), primary key(id));
insert into test(a) values
('accountancy'),
('accountant'),
('accountants'),
('accounting'),
('accountingc'),
('becounting'),
('asdf'),
('this is a test');

select 
test.*,
SOUNDEX(a),
SOUNDEX('accountancy')
FROM
test 
WHERE a SOUNDS LIKE 'accountancy';

If this doesn't solve it, the levenshtein algorithm is the way to go. Talk to your database administrator that he allows you to create functions. If he does, here's the solution (I didn't write the function, credit goes to anonymous):

DELIMITER //
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END//

Again test data:

create table leven(id int auto_increment, a varchar(255), primary key(id));
insert into leven(a) values
('accountancy'),
('accountant'),
('accountants'),
('accounting'),
('accountingc'),
('becounting'),
('asdf'),
('this is a test')
;


select
leven.*,
levenshtein(leven.a, 'accountancy')
from
leven
where levenshtein(leven.a, 'accountancy') <= 3 /*or any value you like*/
fancyPants
  • 50,732
  • 33
  • 89
  • 96
  • @tomdom I tried SOUNDEX but unfortunately it didn't help me much in this case, I do like it for searching people's name though. The levenshtein function sounds interesting but I can't us it on the shared hosting. I think I'd need a few more rules too, maybe just changing the end of the word - the suffixes. – Ric Sep 10 '12 at 09:47
1

Search engines do this by implementing a text processing technique known as stemming. There are many libraries that implement this for you, personally I have user the Snowball stemmer, it does a good enough job.

I'm not familiar enough with the full text search capabilities of MySql, but you could try to apply the stemming algorithms to the search terms. For your example of "accountancy courses" the Snowball stemmer returns "account cours".

javcek
  • 696
  • 6
  • 3
  • I've had some success with stemming. Thanks for the technique. It's helped me to research further. This is of interest: http://stackoverflow.com/questions/190775/stemming-algorithm-that-produces-real-words – Ric Sep 10 '12 at 09:49
0

I don't know much about MATCH, when I want to select a column with variations I do the following

SELECT pjs.title
FROM pxl_jobsearch AS pjs
WHERE pjs.title LIKE 'account%'

I work mostly in SQL Server but do some MySQL. I imagine that this works in MySQL as well.

Malachi
  • 3,205
  • 4
  • 29
  • 46
  • I thought about using like, but my problem is if someone types in 'accountancy' as their search I can't use a wild card to match 'accountant' (accountancy%). Perhaps I could make some assumptions and always remove the last 4 characters so it becomes 'account%', but a shorter word like 'trust' wouldn't work. I want something similar to how Google matches 'accounting course' and 'accountancy courses' when searching for 'accountant courses'. – Ric Aug 23 '12 at 15:26
  • I see what you are saying. that is a good question. please Edit your Question to specify this. – Malachi Aug 23 '12 at 15:30
0

You may use SQL SOUNDEX(), which is pretty useful for your needs: it searches for words that sound the same, not for grammatically close ones. You may use two approaches, which are quite similar.

  • Recognize common suffixes and replace them with a % in the LIKE clause: using your example, accountancy would become account%.
  • Write a "dictionary" that contains all the variations you need, and use it into PHP (stristr() function): so, accountancy would produce a clause similar to WHERE value='accountancy' or value='accountant' or value='accountancies'.
Giulio Muscarello
  • 1,312
  • 2
  • 12
  • 33