3

Is there a way to select rows where one of the column contains only , but any number of, predefined values?

I've been using this, but it returns any rows where my column contains at least one of the values (which is exactly what it's suppose to do, I know).

But I'm looking for a way to only select rows that have ONLY my keywords in the keyword column.

SELECT * 
FROM 
    `products`.`product` 
WHERE 
    keywords LIKE '%chocolate%' 
AND keyword LIKE '%vanilla%';

Example Keywords: chocolate, sugar, milk, oats

Using the above keywords, I would want the first two results returned, but not the last two:

Product1: chocolate, sugar 

Product2: chocolate 

Product3: chocolate, sugar, milk, oats, bran 

Product4: chocolate, sugar, salt

My column contains a comma separated list of all keywords applicable to that product row.

ericdmann
  • 357
  • 5
  • 20

2 Answers2

3

Since you are storing the list as a string containing a comma separated list, rather than as a set, MySQL isn't going to be able to help much with that. When it was inserted into the database, MySQL saw it as a single string. When it's retrieved from the database, MySQL sees it as a single string. When we refer to it in a query, MySQL sees it as a single string.


If the "list" was stored as a standard relational set, with each keyword for a product stored as a separate row in the table, then returning the result set you specified is almost trivial.

For example, if we had this table:

CREATE TABLE product_keyword 
product_id      BIGINT UNSIGNED COMMENT 'FK ref products.id'
keyword         VARCHAR(20)

With each keyword associated to a particular product as a separate row:

product_id keyword
---------- ---------
         1 chocolate
         1 sugar
         2 chocolate
         3 bran
         3 chocolate
         3 milk
         3 oats
         3 sugar
         4 chocolate
         4 salt
         4 sugar

Then to find all rows in product that have a keyword other than 'chocolate' or 'vanilla'

SELECT p.id
  FROM product p
  JOIN product_keyword k
 WHERE k.product_id = p.id
    ON k.keyword NOT IN ('chocolate','vanilla')
 GROUP BY p.id

--or--

SELECT p.id
  FROM product p
  LEFT
  JOIN ( SELECT j.id
           FROM product_keyword j
          WHERE j.keyword NOT IN ('chocolate','vanilla')
         GROUP BY j.id
       ) k
    ON k.id = p.id 
 WHERE k.id IS NULL

To get products that have at least one of the keywords 'chocolate' and 'vanilla', but that have no other keywords associated, it's the same query above, but with an additional join:

SELECT p.id
  FROM product p
  JOIN ( SELECT g.id
           FROM product_keyword g
          WHERE g.keyword IN ('chocolate','vanilla')
         GROUP BY g.id
       ) h
    ON h.id = p.id 
  LEFT
  JOIN ( SELECT j.id
           FROM product_keyword j
          WHERE j.keyword NOT IN ('chocolate','vanilla')
         GROUP BY j.id
       ) k
    ON k.id = p.id 
 WHERE k.id IS NULL

We can unpack those queries, they aren't difficult. Query h returns a list of product_id that have at least one of the keywords, query k returns a list of product_id that have some keyword other than those specified. The "trick" there (if you want to call it that) is the anti-join pattern... doing an outer join to match rows, and include rows that didn't have a match, and a predicate in the WHERE clause that eliminates rows that had a match, leaving the set of rows from product that didn't have a match.


But with the set stored as a "comma separated list" in a single character column, we lose all the advantages of relational algebra; there isn't any easy way to process the list of keywords as a "set".

With the entire list stored as a single string, we've got some horrendous SQL to get the specified result.

One approach to doing the kind of check you specify would be to create a set of all possible "matches", and check those. This is workable for a couple of keywords. For example, to get a list of products that have ONLY the keywords 'vanilla' and/or 'chocolate', (that is, that have at least one of those keywords and does not have any other keyword):

SELECT p.id
  FROM product 
 WHERE keyword_list = 'chocolate'
    OR keyword_list = 'vanilla'
    OR keyword_list = 'chocolate,vanilla'
    OR keyword_list = 'vanilla,chocolate'

But extending that to three, four or five keywords quickly becomes unwieldy (unless the keywords are guaranteed to appear in a particular order. And it's very difficult to check for three out of four keywords.

Another (ugly) approach is to transform the keyword_list into a set, so that we can use queries like the first ones in my answer. But the SQL to do the transformation is limited by an arbitrary maximum number of keywords that can be extracted from the keyword_list.

It's fairly easy to extract the nth element from a comma separated list, using some simple SQL string functions, for example, to extract the first five elements from a comma separated list:

SET @l := 'chocolate,sugar,bran,oats'
SELECT NULLIF(SUBSTRING_INDEX(CONCAT(@l,','),',',1),'')                         AS kw1
     , NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(@l,','),',',2),',',-1),'') AS kw2
     , NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(@l,','),',',3),',',-1),'') AS kw3
     , NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(@l,','),',',4),',',-1),'') AS kw4
     , NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(@l,','),',',5),',',-1),'') AS kw5

But those are still on the same row. If we want to do checks on those, we'd have a bit of comparing to do, we'd need to check each one of those to see if it was in the specified list.

If we can get those keywords, on the one row, transformed into a set of rows with one keyword on each row, then we could use queries like the first ones in my answer. As an example:

SELECT t.product_id
     , NULLIF(CASE n.i
       WHEN 1 THEN SUBSTRING_INDEX(CONCAT(t.l,','),',',1)
       WHEN 2 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(t.l,','),',',2),',',-1)
       WHEN 3 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(t.l,','),',',3),',',-1)
       WHEN 4 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(t.l,','),',',4),',',-1)
       WHEN 5 THEN SUBSTRING_INDEX(SUBSTRING_INDEX(CONCAT(t.l,','),',',5),',',-1)
       END,'') AS kw
  FROM ( SELECT 4 AS product_id,'fee,fi,fo,fum' AS l  
          UNION ALL 
         SELECT 5, 'coffee,sugar,milk'
        ) t
 CROSS
  JOIN ( SELECT 1 AS i
         UNION ALL SELECT 2
         UNION ALL SELECT 3
         UNION ALL SELECT 4
         UNION ALL SELECT 5
       ) n
HAVING kw IS NOT NULL
ORDER BY t.product_id, n.i

That gets us individual rows, but it's limited to a row for each of the first 5 keywords. It's easy to see how that would be extended (having n return 6,7,8,...) and extending the WHEN conditions in the CASE to handle 6,7,8...

But there is going to be some arbitrary limit. (I've used an inline view, aliased as t, to return two "example" rows, as a demonstration. That inline view could be replaced with a reference to the table containing the product_id and keyword_list columns.)

So, that query gets us a rowset like would be returned from the product_keyword table I gave as an example above.

In the example queries, references to the product_keyword table could be replaced with this query. But that's a whole lot of ugly SQL, and its horrendously inefficient, creating and populating temporary MyISAM tables anytime a query is run.

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • This is perfect! I can easily recreate the table with the correct structure to make this work. Thank you! I was unsure how to store the keywords when I created the table, but I should just be able to create them correctly to make it work best. – ericdmann Apr 03 '14 at 23:28
  • @loopifnil: Just to be clear, I wasn't referring to the MySQL `"SET"` datatype. By "set" I was referring just to a "set of rows" in a table, with each row representing one keyword for a product. That's as opposed to a single row containing a string. (And not that the `"SET"` datatype doesn't have some performance benefits, it does, but it's limited to a static list of valid values; and it has the shortcomings as the comma separated list in a string does, in terms of processing as rows. – spencer7593 Apr 04 '14 at 00:02
1

You probably want to setup a fulltext index on keywords for your table. This allows you to search the keywords column and specify what keywords to include or not include. Here's a command which sets up the index:

ALTER TABLE products ADD FULLTEXT index_products_keywords (keywords);

Once you've done that, you can use the MATCH AGAINST phrase and specify keywords. You can use it like WHERE MATCH(keywords) AGAINST ('chocolate') to just search for the term chocolate. Or you can use BOOLEAN MODE to "turn-off" certain keywords.

SELECT * FROM products 
WHERE MATCH(keywords) AGAINST ('+chocolate -bran' IN BOOLEAN MODE);

Here's a small tutorial about fulltext indexes

Leroy
  • 544
  • 3
  • 14
  • Would there be a way to turn off all keywords but the few inputted by the user? My keyword database is quite large. – ericdmann Apr 03 '14 at 20:33
  • I don't think it works like that. If you somehow turned off all keywords, then searched for `chocolate` only a row with a single keyword of `chocolate` would show up. – Leroy Apr 03 '14 at 20:40
  • That's what I would want to happen. I want to have the user provide a list of keywords, then only return products that ONLY contain keywords from the list provided, but any number of the keywords. – ericdmann Apr 03 '14 at 20:44
  • In boolean mode, if you searched like `+chocolate +sugar +milk +oats` then it would only find the keywords which contained each of the words: chocolate, sugar, milk, and oats. The more specific the user is with the search, the more narrow the results will be. – Leroy Apr 03 '14 at 20:51
  • If a user searches for "Chocolate". I only want to return the products where there is only one keyword, and it's chocolate. If they were to search for chocolate, sugar, and milk. I would want it to return results that contained only chocolate, sugar, and milk -> OR any combination of keywords. I'm doing my best to explain the question, sorry for the confusion! – ericdmann Apr 03 '14 at 20:57
  • I don't know of a way to do what you want with your setup. The best shot I have is using `SELECT 'chocolate' REGEXP '^chocolate$'` which just checks to see if the keywords begins and ends with `chocolate`. You could search for specific phrases, i.e "chocolate, sugar, milk, oats", but you would only find products with that exact phrase -- those keywords in that exact order. – Leroy Apr 03 '14 at 21:09
  • I'm very willing to reorganize my table to make it happen. Did you have another setup in mind? – ericdmann Apr 03 '14 at 21:23
  • Maybe a better way to rephrase it is, how do I return rows that only have certain keywords. – ericdmann Apr 03 '14 at 21:27