Find similar entities in EAV model

Question

I need a query that returns "similar" products in my EAV table that:

1) share at least one similar attribute

2) have no attributes different to the product

eg

ProductID Attribute Value
1         Prop1      1
1         Prop2      2
2         Prop1      1
3         Prop1      1
3         Prop2      3

Say in this example, searching for products similar to product id 1 (Prop1:1 and Prop2:2). Product 2 would be returned, because Prop1 is 1, but product 3 is not ok, because Prop2 is different.etc.

There is a variable number of attributes for each product, so joining the table for each attribute is not an option. At the moment I'm concatenating the list of props to build a dynamic SQL "where", but I can't find a good (fast?) SQL statement that will do this.

Maybe I've spent too much time focusing on this problem, but I can't shake the feeling that I'm missing an obvious way to do this...

I don't understand your example. Why does productid=2 match and why does productid=3 not match? what exactly is your business rule? — BellevueBob, Dec 16 '12 at 18:39
@BobDuell: Where the attributes are in common, the values must be the same. Although ProductIDs 1 and 3 share Prop1 and have the same value, they also share Prop2 but the values are different, so 3 is rejected as not similar to 1. By contrast, ProductIDs 1 and 2 share Prop1 and have the same value, so they're OK. I think that if ProductID 4 had Prop1 with value 1 and also Prop3 with value 2, it would match ProductID 1. — Jonathan Leffler, Dec 16 '12 at 18:44
exactely what Jonathan said. I need to get products that have one or more of the props as the one I'm looking for. It's ok, if they have OTHER props, or missing props, as long as it does not have a prop I'm looking for but with a wrong value — Radu094, Dec 16 '12 at 18:50
This sort of thing gets messy. Take a look at [How to select similar sets in SQL](http://stackoverflow.com/questions/13132270/how-to-select-similar-sets-in-sql/), and try not to panic too much. — Jonathan Leffler, Dec 16 '12 at 19:03
This question and its answers illustrate very nicely the difficulties of querying against an EAV data model. The solution (and the work the database is required to do) would not be much easier if it were a simple search for duplicates! Perhaps readers who may be tempted by the "elegance" and "flexibility" of EAV may see this and reconsider. — Lord Peter, Dec 16 '12 at 21:16

score 3 · Accepted Answer · answered Dec 16 '12 at 20:47

When faced with this sort of problem, I use TDQD — Test-Driven Query Design.

Please note that it helps everyone if you give your table a name!

Pass 1

List of products with one or more attributes identical to product 1

SELECT a.ProductID, COUNT(*) AS matches
  FROM EAV_Table AS a
  JOIN EAV_Table AS b
    ON a.Attribute = b.Attribute AND a.value = b.value
 WHERE a.ProductID != 1
   AND b.ProductID  = 1
 GROUP BY a.ProductID

This clearly won't list any products where the count is 0, which is fine.

List of products with one or more attributes mismatching product 1

SELECT c.ProductID, COUNT(*) AS matches
  FROM EAV_Table AS c
  JOIN EAV_Table AS d
    ON c.Attribute = d.Attribute AND c.value != d.value
 WHERE c.ProductID != 1
   AND d.ProductID  = 1
 GROUP BY c.ProductID

This also won't list products where the count is 0, which is more of a nuisance.

Result — Pass 1

We need all the products from the first query where the product is not listed in the second query. That can be expressed by a NOT EXISTS and a correlated sub-query:

SELECT a.ProductID, COUNT(*) AS matches
  FROM EAV_Table AS a
  JOIN EAV_Table AS b
    ON a.Attribute = b.Attribute AND a.value = b.value
 WHERE a.ProductID != 1
   AND b.ProductID  = 1
   AND NOT EXISTS
       (SELECT c.ProductID
          FROM EAV_Table AS c
          JOIN EAV_Table AS d
            ON c.Attribute = d.Attribute AND c.value != d.value
         WHERE c.ProductID != 1
           AND d.ProductID  = 1
           AND c.ProductID = a.ProductID
       )
 GROUP BY a.ProductID

That's pretty ugly. It works, but it is ugly.

Test Data

CREATE TABLE eav_table
(
    productid INTEGER NOT NULL,
    attribute CHAR(5) NOT NULL,
    value INTEGER NOT NULL,
    PRIMARY KEY(productid, attribute, value)
);

INSERT INTO eav_table VALUES(1, "Prop1", 1);
INSERT INTO eav_table VALUES(1, "Prop2", 2);
INSERT INTO eav_table VALUES(2, "Prop1", 1);
INSERT INTO eav_table VALUES(3, "Prop1", 1);
INSERT INTO eav_table VALUES(3, "Prop2", 3);
INSERT INTO eav_table VALUES(4, "Prop1", 1);
INSERT INTO eav_table VALUES(4, "Prop3", 1);

Q1 Result

2    1
3    1
4    1

Q2 Result

3    1

Q3 Result

2    1
4    1

The ones are the counts I generated; a more polished rendition would remove them.

Pass 2

If it can be managed, a nicer final query would join a table that lists all the product IDs which have at least one matching attribute/value pair in common with product ID 1 with a table that lists all the product IDs which have zero disagreements with product ID 1.

List of products with one or more attributes identical to product 1

The first query is the same as the first query in Pass 1, except we'll drop the count in the result set.

SELECT a.ProductID
  FROM EAV_Table AS a
  JOIN EAV_Table AS b
    ON a.Attribute = b.Attribute AND a.value = b.value
 WHERE a.ProductID != 1
   AND b.ProductID  = 1
 GROUP BY a.ProductID

Either the GROUP BY clause or a DISTINCT in the select list is necessary, in general (though the sample data doesn't formally require it).

List of products with zero attributes that mismatch product 1

We will exploit the fact that COUNT(column) only counts non-null values, and use a LEFT OUTER JOIN.

SELECT c.ProductID
  FROM      EAV_Table AS c
  LEFT JOIN EAV_Table AS d
    ON c.Attribute = d.Attribute
   AND c.Value != d.Value
   AND c.ProductID != 1
   AND d.ProductID  = 1
 GROUP BY c.ProductID
HAVING COUNT(d.Value) == 0;

Note that the WHERE clause has been merged into the ON clause; this is actually rather important.

Result — Pass 2

We build the two queries above as sub-queries that are joined to generate the final result:

SELECT f.ProductID
  FROM (SELECT a.ProductID
          FROM EAV_Table AS a
          JOIN EAV_Table AS b
            ON a.Attribute = b.Attribute AND a.value = b.value
         WHERE a.ProductID != 1
           AND b.ProductID  = 1
         GROUP BY a.ProductID
       ) AS e
  JOIN (SELECT c.ProductID
          FROM      EAV_Table AS c
          LEFT JOIN EAV_Table AS d
            ON c.Attribute = d.Attribute
           AND c.Value != d.Value
           AND c.ProductID != 1
           AND d.ProductID  = 1
         GROUP BY c.ProductID
        HAVING COUNT(D.Value) = 0
       ) AS f
    ON e.ProductID = f.ProductID

This produces the answer 2 and 4 on the sample data.

Note that one part of this exercise is learning not to be satisfied with the first answer you develop. Note that it would be good to benchmark the solutions on full size data sets instead of test data sets with only 7 rows in the table.

Beautifull! It looks similar to the query I was starting to build, but with the remarcable difference that yours actually works! — Radu094, Dec 16 '12 at 21:24

Kaf · Answer 2 · 2012-12-16T19:15:06.550

1

If I understand your question correctly I think this should do the job. Here is the fiddle;

DECLARE @pId INT = 1

SELECT A.pid
FROM (
        SELECT pid, count(*) total
        FROM t
        WHERE pid <> @pId
        GROUP BY pid 
     ) A JOIN 
     (
        SELECT pid, count(*) matches
        FROM t 
        WHERE pid<>@pId and att + ':' + convert(varchar(12), val) in (
             SELECT att + ':' + convert(varchar(12), val) FROM t
             WHERE pid=@pId)
        GROUP BY pid
     ) B ON A.pid = B.pid

WHERE total = matches

NOTE: Edited as per comments with additional data

edited Dec 16 '12 at 19:15

answered Dec 16 '12 at 18:37

Kaf

33,101
7
58
78

That won't reject 3, will it? – Jonathan Leffler Dec 16 '12 at 18:39
@JonathanLeffler check out fiddle, it does. answer is `2` – Kaf Dec 16 '12 at 18:40
It will reject 3, but only cause' it has two props defined, not because those props will not match..try to set prop2 of product 3 to '2' and it will still (wrongfully) reject it :-( – Radu094 Dec 16 '12 at 18:44
@Radu094 that means checking the values as well which you did not mention in the OP did you? Not sure – Kaf Dec 16 '12 at 18:48
@Radu094, check out now, bit ugly though. – Kaf Dec 16 '12 at 19:03
Close, but not there yet. I've expanded on your query and it seems to work OK here http://sqlfiddle.com/#!3/f0cd9/4 Brilliant idea to concatenate the prop+val. Any ideea what performance penalty I can expect on this (assuming 10000's of products, etc) ? – Radu094 Dec 16 '12 at 20:36
@Radu094, I think your expansion contradicts with your question `2) have no attributes different to the product`. Please check the data (I had added some). `ProductId 3 should not be selected` as it has `(3,'prop3',3)` which is different from the list of `prop, val` pairs of productId 1. – Kaf Dec 16 '12 at 21:27
yeah, sorry'bout that, I explained it poorly: 2) should be "have no attributes that my product has that are of different values "(ie. all the attributes they do have in common must have equal values) – Radu094 Dec 17 '12 at 10:47

wildplasser · Answer 3 · 2012-12-16T23:10:19.193

For completeness, using a CTE. (Note: this will find all the twins, not only those for productId = 1)

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE eav
        ( zentity INTEGER NOT NULL
        , zattribute varchar NOT NULL
        , zvalue INTEGER
        , PRIMARY KEY (zentity,zattribute)
        );
INSERT INTO eav(zentity, zattribute, zvalue) VALUES
 (1, 'Prop1',1) ,(1, 'Prop2',2)
,(2, 'Prop1',1)
,(3, 'Prop1',1) ,(3, 'Prop2',3)
,(4, 'Prop1',1) ,(4, 'Prop3',3) -- added by Jonathan L.
        ;

        -- CTE: pair of entities that have an 
        -- {attribute,value} in common
WITH pair AS (
        SELECT a.zentity AS one
                , b.zentity AS two
                , a. zattribute AS att
        FROM eav a
        JOIN eav b ON a.zentity <> b.zentity  -- tie-breaker
                AND a.zattribute = b.zattribute
                AND a.zvalue = b.zvalue
        )
SELECT pp.one, pp.two, pp.att
FROM pair pp
        -- The Other entity (two) may not have extra attributes
        -- NOTE: this NOT EXISTS could be repeated for pair.one, to also
        -- suppress the one.* products that have an extra attribute
WHERE NOT EXISTS (
        SELECT * FROM eav nx
        WHERE nx.zentity = pp.two
        AND nx.zattribute <> pp.att
        )
ORDER BY pp.one, pp.two, pp.att
        ;

BTW: the real underlying problem is "relational division". Maybe a newer SQL standard should introduce an operator for it?

Find similar entities in EAV model

3 Answers3

Pass 1

List of products with one or more attributes identical to product 1

List of products with one or more attributes mismatching product 1

Result — Pass 1

Test Data

Q1 Result

Q2 Result

Q3 Result

Pass 2

List of products with one or more attributes identical to product 1

List of products with zero attributes that mismatch product 1

Result — Pass 2