6

Right now I am planning to add a filter system to my site.

Examples:

(ID=apple, COLOR=red, TASTE=sweet, ORIGIN=US)
(ID=mango, COLOR=yellow, TASTE=sweet, ORIGIN=MEXICO)
(ID=banana, COLOR=yellow, TASTE=bitter-sweet, ORIGIN=US)

so now I am interested in doing the following: SELECT ID FROM thisTable WHERE COLOR='yellow' AND TASTE='SWEET'

But my problem is I am doing this for multiple categories in my site, and the columns are NOT consistent. (like if the table is for handphones, then it will be BRAND, 3G-ENABLED, PRICE, COLOR, WAVELENGTH, etc)

how could I design a general schema that allows this?

Right now I am planning on doing:

table(ID, KEY, VALUE)

This allows arbitary number of columns, but for the query, I am using SELECT ID FROM table WHERE (KEY=X1 AND VALUE=V1) AND (KEY=X2 AND VALUE=V2), .. which returns an empty set.

Can someone recommend a good solution to this? Note that the number of columns WILL change regularly

crapbag
  • 89
  • 1
  • 4
  • 1
    It is to my surprise that Reddit uses EVA extensively. http://carsonified.com/blog/dev/steve-huffman-on-lessons-learned-at-reddit/ – crapbag May 10 '10 at 09:32

4 Answers4

7

The entity-attribute-value model that you suggest could fit in this scenario.

Regarding the filtering query, you have to understand that with the EAV model you will sacrifice plenty of query power, so this can become quite tricky. However this one way to tackle your problem:

SELECT    stuff.id 
FROM      stuff 
JOIN      (SELECT    COUNT(*) matches
           FROM      table
           WHERE     (`key` = X1 AND `value` = V1) OR 
                     (`key` = X2 AND `value` = V2) 
           GROUP BY  id
          ) sub_t ON (sub_t.matches = 2 AND sub_t.id = stuff.id)
GROUP BY  stuff.id;

One inelegant feature of this approach is that you need to specify the number of attribute/value pairs that you expect to match in sub_t.matches = 2. If we had three conditions we would have had to specify sub_t.matches = 3, and so on.

Let's build a test case:

CREATE TABLE stuff (`id` varchar(20), `key` varchar(20), `value` varchar(20));

INSERT INTO stuff VALUES ('apple',  'color',  'red');
INSERT INTO stuff VALUES ('mango',  'color',  'yellow');
INSERT INTO stuff VALUES ('banana', 'color',  'yellow');

INSERT INTO stuff VALUES ('apple',  'taste',  'sweet');
INSERT INTO stuff VALUES ('mango',  'taste',  'sweet');
INSERT INTO stuff VALUES ('banana', 'taste',  'bitter-sweet');

INSERT INTO stuff VALUES ('apple',  'origin',  'US');
INSERT INTO stuff VALUES ('mango',  'origin',  'MEXICO');
INSERT INTO stuff VALUES ('banana', 'origin',  'US');

Query:

SELECT    stuff.id 
FROM      stuff 
JOIN      (SELECT    COUNT(*) matches, id
           FROM      stuff
           WHERE     (`key` = 'color' AND `value` = 'yellow') OR 
                     (`key` = 'taste' AND `value` = 'sweet')
           GROUP BY  id
          ) sub_t ON (sub_t.matches = 2 AND sub_t.id = stuff.id)
GROUP BY  stuff.id;

Result:

+-------+
| id    |
+-------+
| mango |
+-------+
1 row in set (0.02 sec)

Now let's insert another fruit with color=yellow and taste=sweet:

INSERT INTO stuff VALUES ('pear', 'color', 'yellow');
INSERT INTO stuff VALUES ('pear', 'taste', 'sweet');
INSERT INTO stuff VALUES ('pear', 'origin', 'somewhere');

The same query would return:

+-------+
| id    |
+-------+
| mango |
| pear  |
+-------+
2 rows in set (0.00 sec)

If we want to restrict this result to entities with origin=MEXICO, we would have to add another OR condition and check for sub_t.matches = 3 instead of 2.

SELECT    stuff.id 
FROM      stuff 
JOIN      (SELECT    COUNT(*) matches, id
           FROM      stuff
           WHERE     (`key` = 'color' AND `value` = 'yellow') OR 
                     (`key` = 'taste' AND `value` = 'sweet') OR 
                     (`key` = 'origin' AND `value` = 'MEXICO')
           GROUP BY  id
          ) sub_t ON (sub_t.matches = 3 AND sub_t.id = stuff.id)
GROUP BY  stuff.id;

Result:

+-------+
| id    |
+-------+
| mango |
+-------+
1 row in set (0.00 sec)

As in every approach, there are certain advantages and disadvantages when using the EAV model. Make sure you research the topic extensively in the context of your application. You may even want to consider an alternative relational databases, such as Cassandra, CouchDB, MongoDB, Voldemort, HBase, SimpleDB or other key-value stores.

Daniel Vassallo
  • 337,827
  • 72
  • 505
  • 443
  • Wow this looks pretty complicated. Thanks for the solution. There are a lot of people opposing to me using this design, so for now I am seriously considering whether I should use the EVA model – crapbag May 04 '10 at 11:22
  • @sadvaw: The opposition mainly stems from the fact that when you use the EAV model in a relational database, it is like using a van to carry yourself around the city: therefore you're not using it for what it was built to do. However it can still be done, and the feasibility of such often depends on the scale (how much you do it, or how big). Therefore I would say that if all you are doing in the database is this, then I would actually consider alternatives to an RDBMS. However if you have a larger database and this is just a small part, then these considerations could be less important. – Daniel Vassallo May 04 '10 at 15:40
1

The following worked for me:

SELECT * FROM mytable t WHERE 
    t.key = "key" AND t.value = "value" OR
    t.key = "key" AND t.value = "value" OR
    ....
    t.key = "key" AND t.value = "value"
GROUP BY t.id having count(*)=3;

count(*)=3 must match the amount of

t.key = "key" AND t.value = "value"

cases

Tarlog
  • 10,024
  • 2
  • 43
  • 67
0

What you are suggesting is known as an Entity-Attribute-Value structure and is highly discouraged. One of the (many) big problems with EAV designs for example is in data integrity. How you do enforce that colors only consist of "red", "yellow", "blue" etc? In short, you can't without a lot of hacks. Another problem rears itself in querying (as you have seen) and searching for data.

Instead, I would recommend creating a table that represents each type of entity and thus each table can have attributes (columns) that are specific to that type of entity.

In order to convert the data into columns in a result query as you are seeking, you will need to create what is often called a crosstab query. There are report engines that will do it and you can do it code but most database products will not do it natively (meaning without building the SQL string manually). The performance of course will not be good if you have a lot of data and you will run into problems filtering on the data. For example, suppose that some of the values are supposed to be numeric. Because the value part of the EAV is likely to be a string, you will have to cast those values to an integer before you can filter on them and that presumes that the data will be convertible to an integer.

Thomas
  • 63,911
  • 12
  • 95
  • 141
0

The price you pay for simplistic table design at this stage will cost you in terms of performance in the long run. Using ORM to reduce the cost of modifying the database to fit data in an appropriate structure would probably be a good time investment, even in spite of ORM's performance cost.

Otherwise, you may want to look for a "reverse ORM" that maps the code from your database, which has the benefit of being less expensive and having higher performance. (Slightly higher starting cost compared to ORM, but better long-term performance and reliability.)

It's a costly problem regardless of how you slice it. Do you want to pay now with development time or pay later when your performance tanks? ("Pay later" is the wrong answer.)

cbednarski
  • 11,718
  • 4
  • 26
  • 33
  • Can you recommend a table design that fits your answer? I do not really understand what you are referring to. – crapbag May 04 '10 at 11:33
  • I came across the name for the theory I was alluding to: Anchor Modeling. The source is a bit academic: http://syslab.dsv.su.se/profiles/blogs/anchor-modeling so you may find this explanation a bit easier to digest: http://askmonty.org/wiki/Manual:Table_Elimination The (related but separate) point of procedural database translation (ORM or reverse ORM techniques) is to reduce the amount of code you have to write to access a more complex, specialized data structure that has superior performance, normalization, and relational characteristics. – cbednarski May 09 '10 at 20:06