I'm developing a search engine for a CCG. I want the user to be able to find cards based on a query like, "blue brigade hero enhancements that can discard ec's"
or "purple kings of israel"
. There are many variables to search for: brigades (purple, blue), types (heroes, evil characters [ec's]), special abilities (discard), and identifiers (kings of israel). I'm thinking about regexing to find common search parameters. I know this won't be easy, and it will take a long time to fine tune, but can someone point me in the right direction? Is regex even a recommend solution? I don't know if it's important, but I'm using php and mysql.

- 7,953
- 19
- 62
- 119
-
2You might consider taking a look at [full-text searching in MySql](http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html), just to get a feel for some of your other options. – Michael Fredrickson Mar 27 '12 at 06:26
-
Full text wont work for just using the whole string. Each of those variable types i explained each have their own table. – LordZardeck Mar 27 '12 at 06:35
3 Answers
You'll have to write a parser to parse such query strings.
Regular expressions will be useful to find 'verbs' and 'nouns' in query strings, but you probably will also need a non-context grammar describing the language of your queries, for example something like this:
<QUERY> := <TARGET_SPEC>
<TARGET_SPEC> := <OBJECT> 'that can' <ABILITY>
<TARGET_SPEC> := <OBJECT>
<OBJECT> := <COLOR> <WHAT>
<OBJECT> := <WHAT>
<COLOR> := 'blue' | 'red' | 'purple' | 'green'
<WHAT> := <ITEM> | <HERO>
<ITEM> := <ADJECTIVE> <ITEM>
<ADJECTIVE> := 'brigade' | 'hero' | 'magic' | 'enhanced' | 'rustproof'
<ITEM> := 'enhancements' | 'sword' | 'potion'
<HERO> := <HERO> 'of' <COUNTRY>
<HERO> := 'kings' | 'knights' | 'thiefs'
<COUNTRY> := 'israel' | 'palestine' | 'jordan' | 'egypt'
<ABILITY> := <ABILITY> 'and' <ABILITY>
<ABILITY> := 'swim' | 'dance' | discard <DISCARDABLE> | 'kill' <HERO> | 'use' <ITEM>
<DISCARDABLE> := 'ec's' | 'et's' | 'etc'
Parser built around such a grammar will be able to determine which part of your query is an object, which is an ability, color, country etc. For example, given input string 'red knights of jordan that can swim', parser will choose right rules and apply them:
<QUERY> := 'red knights of jordan that can swim'
<TARGET_SPEC> := 'red knights of jordan that can swim'
<TARGET_SPEC> := 'red knights of jordan' 'that can' 'swim'
<OBJECT> := 'red knights of jordan'
<ABILITY> := 'swim'
<COLOR> := 'red'
<WHAT> := 'knights of jordan'
<HERO> := 'knights' 'of' 'jordan'
<HERO> := 'knights'
<COUNTRY> := 'jordan'
Basing on extracted information you'll be able to create search criteria.
Using a grammar has an additional benefit of resolving some ambiguities that would be hard to resolve any other way - for example, if user asks for 'red kings that can kill white knights', simple algorithms that just look for color by matching every word with list of colors available will fail.
I recommend reading a book on compiler design - Dragon Book is a classic choice (you don't have to read all of it, just the part about lexers and parsers).
If you don't want to code the entire parser yourself (as this can be pretty time-consuming and error-prone), you'll need a parser generator (that is, a program that creates parser source code for given grammar); here is a question with some suggestions for PHP.
You should also consider reading up on Natural Language Processing techniques. There is an online course from Stanford University here, I'm 'attending' it now and can wholeheartedly recommend it.
-
could you explain how I could use a programming language parser to parse questions like the one in my question? Programming languages rely on "punctuation" (curly braces and semi-colons) and select key words (while, for, if) to break apart text. I don't see how I can use that with questions which have none of these. – LordZardeck Mar 29 '12 at 07:52
-
Your queries will also have some punctation - words like 'that can', 'of', etc. I will expand my answer with example grammar. – socha23 Mar 29 '12 at 07:56
-
I've already accepted as an answer and awarded the bounty, but could you possibly explain how I might do this without having to dump the entire brigade, effect, and identifier tables into this one file? There are over 100 different identifiers and plenty of effects. – LordZardeck Mar 30 '12 at 17:04
-
Particulars would depend on the implementation details of your parser (I've never written one in PHP), but the general idea is to modify the rule for
, – socha23 Mar 30 '12 at 19:30or other 'basic' parts of your grammar not to match input by regex / enumerated strings, but rather by looking for it in an array containing all valid inputs for that category. Perhaps the best way to tackle that problem would be to generate a parser matching just a few identifiers / abilities, and then modify it by hand.
I really like socha's suggestion, but I would consider a much simpler one as well.
If you have a dictionary of known search terms and the ability to correct them for syntax and grammar (hint: use your database and OED as a cache layer, throwing any cache misses at Google), you can perform the search by binary bucket sorting each term into sets of known types. Using your example, each bucket would be: brigade_purple, brigade_blue, type_hero, type_evil, each of your special abilities, and each of your special type identifiers.
For each card, construct a bitfield conforming to your buckets. For each user query, construct the same. Then, return the results that conform to your bitmask, by performing a bitwise traversal of the database, which I assume for this toy example to be shaped like a B+ tree, sorting by results closest to the mask in major-bit order. This has the benefit of being extensible up to the maximum length of your backing bitfield, which can be practically unbounded in many database implementations.
Okay, so that's a bit technical. It's how I would construct the search database, in any event.
WITH TierTempCur As
--/*Use Rela table to get the offspring of the parent*/
(
SELECT Rela.ID_RSSD_PARENT
, Rela.ID_RSSD_OFFSPRING
, '12/31/2011' AS REPORT_DATE
, 1 As TREE_LVL
, CHECKSUM(ID_RSSD_PARENT, ID_RSSD_OFFSPRING) As CHKSUM
, RIGHT('000000000'+ CONVERT(VARCHAR(MAX),ID_RSSD_OFFSPRING),9) AS RSSD_PATH
FROM CUV_RELATIONSHIPS As Rela
WHERE ID_RSSD_PARENT = 451965 AND '12/31/2011' BETWEEN D_DT_START AND D_DT_END
AND Rela.CTRL_IND = 1 --/* indicates subsidiary */
AND Rela.OTHER_BASIS_IND not in (3,8) --/* Per DM's job */
UNION ALL
SELECT Rela.ID_RSSD_PARENT
, Rela.ID_RSSD_OFFSPRING
, REPORT_DATE
, TREE_LVL + 1 As TREE_LVL
, CHECKSUM(Rela.ID_RSSD_PARENT, Rela.ID_RSSD_OFFSPRING) As CHKSUM
, Tmp.RSSD_PATH + '\' + RIGHT('000000000'+ CONVERT(VARCHAR(MAX),Rela.ID_RSSD_OFFSPRING),9) AS RSSD_PATH
FROM CUV_RELATIONSHIPS As Rela
INNER JOIN TierTempCur As Tmp
ON Rela.ID_RSSD_PARENT = Tmp.ID_RSSD_OFFSPRING
AND REPORT_DATE BETWEEN Rela.D_DT_START AND Rela.D_DT_END
WHERE TREE_LVL < 20 --/*max depth for the tier is 20 -- to end self referencing parent/child relationships */
AND Rela.CTRL_IND = 1 --/* indicates subsidiary */
AND Rela.OTHER_BASIS_IND not in (3,8)
),