2

Lets assume I have 3 tables in my Sql Serer 2008 database:

CREATE TABLE [dbo].[Properties](
    [PropertyId] [int] NOT NULL,
    [PropertyName] [nvarchar](50) NOT NULL
)

CREATE TABLE [dbo].[Entities](
    [EntityId] [int] NOT NULL,
    [EntityName] [nvarchar](50) NOT NULL
)    

CREATE TABLE [dbo].[PropertyValues](
    [EntityId] [int] NOT NULL,
    [PropertyId] [int] NOT NULL,
    [PropertyValue] [int] NOT NULL
)
  1. Table Properties contains possible set of Properties which values can set up configured for business objects.
  2. Table Entities contains business objects which are configured from app.
  3. Table 3 contains selected Property values for business objects. Each business object can contain its own set of properties (i.e. "Property1" can be configured for first object but not configured for the second one).

My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.

Any suggestions?

[ADDED] For example there is an entry in Entities table with EntityId = 1. In PropertyValues table there are 3 row which are related to this entry:

 EntityId  PropertyId  PropertyValue
 1         4           Val4
 1         5           Val5
 1         6           Val6

The requirement is to find other entries in Entity table which have 3 related rows in PropertyValues table and these rows contain the same data as rows for EntityId = 1 (besides of EntityId column)

[ADDED] Please, see my new question: Best approach to store data which attributes can vary

[BOUNTY1] Thanks for all. The answers were very helpful. My task is complicated a little bit (but this complication can be useful in performance purposes). Please, see the details below:

  • The new table named EntityTypes is added

  • EntityTypeId column has been added into Entities and Properties tables

  • Now, there are several types of entities. Each entity has it's own set of properties.

    Is it possible to increase performance using this information?

[BOUNTY2] There is the second complication:

  • IsDeleted column is added to Property table
  • PropertyValues table can have values for Properties which already deleted from database. Entities which have such properties are considered invalid.
  • Some entities don't have values for each property of EntityType set. These entities also are considered as invalid.

The question is: How do I can write a script which will select all Entities and additional column IsValid for them.

Community
  • 1
  • 1
Egor4eg
  • 2,678
  • 1
  • 22
  • 41
  • 7
    Please don't say "Performance is critical" when you are using nvarchar http://stackoverflow.com/questions/35366/varchar-vs-nvarchar-performance/198753#198753 and an EAV schema http://en.wikipedia.org/wiki/Entity-attribute-value_model – gbn Aug 16 '11 at 10:13
  • 1
    I cannot change my schema. Set of properties is customized by site administrators and it can be easily changed. Actually, it is not my real database. It is just a simple example. The real database more complex. Nvarchar is required, because I have to support multilanguage users input. – Egor4eg Aug 16 '11 at 10:43
  • Could you just clarify - the requirement is that you have some Object in your **application** which was created using values in the database? But you want to find the corresponding record in the database which created the business object? – Arj Aug 16 '11 at 11:19
  • @a12jun: I added an example. Sorry for the ambiguity. – Egor4eg Aug 16 '11 at 11:31
  • Nice example, makes a lot more sense now! – Arj Aug 16 '11 at 11:44
  • @gbn: It would be very nice to use a more effective approach of storing of the data in my database. I am opened for discussion. Could you give me an advice how I could store it? Please keep in mind that set of properties must be customizable. Thanks. – Egor4eg Aug 16 '11 at 11:57
  • Can an entity have duplicate properties (with either identical or different values)? – Andriy M Aug 16 '11 at 12:04
  • @Andriy M: duplicated properties are not possible – Egor4eg Aug 16 '11 at 12:07
  • 2
    You started with a tricky question, then added complexity with an edit, and then even more complexity with another edit. With the criteria now spread across a long question, I'm finding it hard to fully understand the problem. You should probably have just submitted a second and fully revised question. – Philip Kelley Aug 23 '11 at 14:06
  • If in entity has a row marked deleted then should the entityID in whole be ignored or just that row? Same question for type. – paparazzo Aug 23 '11 at 19:33

5 Answers5

5
;with cteSource as
(
  select PropertyId,
         PropertyValue
  from PropertyValues
  where EntityId = @EntityID
)
select PV.EntityId
from PropertyValues as PV
  inner join cteSource as S  
    on PV.PropertyId = S.PropertyId and
       PV.PropertyValue = S.PropertyValue and
       PV.EntityId <> @EntityID
group by PV.EntityId
having count(*) = (select count(*)
                   from cteSource) and
       count(*) = (select count(*)
                   from PropertyValues as PV1
                   where PV1.EntityId = PV.EntityId)

For your addition you can add this where clause:

where -- exlude entities with deleted properties
      PV.EntityID not in (select PV2.EntityID
                          from Properties as P
                            inner join PropertyValues as PV2
                              on P.PropertyID = PV2.PropertyID
                          where P.IsDeleted = 1)
      -- exclude entities with missing EntityType                     
  and PV.EntityID not in (select E.EntityID
                          from Entities as E
                          where E.EntityType is null) 

Edit:

If you want to test the query against some sample data you can do so here: https://data.stackexchange.com/stackoverflow/q/110243/matching-properties

Community
  • 1
  • 1
Mikael Eriksson
  • 136,425
  • 22
  • 210
  • 281
  • This is the answer for the first part of the question. Please, see addition questions. – Egor4eg Aug 22 '11 at 15:49
  • I think you need to make it a left outer join, for when the "other" entity(s) have more properties than the "base" (cte) entity. Then one of your `having` counts has to factor in something like `sum(case when LOJ got null then 0 else 1 end) = sum(cte count)`. [I'd work out and submit it myself, but then I'd have to work out all that subsequent EntityType logic, factor in more joins that can only slow the query down, and end up explaining why it'll probably all turn into table scans anyway, so I'm leaving it as a +1 for you.] – Philip Kelley Aug 23 '11 at 14:44
  • I am afraid this will give you at least one match and count the same. But not require that ALL matches are the same. – paparazzo Aug 23 '11 at 16:09
  • @BalamBalam - I have added a link to where you can test the query. Please let me know if you can find a combination of that does not give the correct result. – Mikael Eriksson Aug 23 '11 at 16:22
  • @Philip Kelley - Not sure what you mean. I have added a link to where you can test the query. Please let me know if find anything I have missed. – Mikael Eriksson Aug 23 '11 at 16:23
  • I agree a match on all values and a match on count will make the list. What would stop a single match and a match on count to also make the list. In the query would not alpha, bravo, delta match alpha, charlie, echo. I may be wrong. That is just how I read the query. I don't have real data to test it. – paparazzo Aug 23 '11 at 16:24
  • @BalamBalam - No. That would be the case with EntityID=7 in this query. http://data.stackexchange.com/stackoverflow/q/110244/matching-properties – Mikael Eriksson Aug 23 '11 at 16:28
  • OK I agree it works. That first count(*) is a of the the matches. But technically the question ask for objects plural and this solution only tests on at a time. Nice work. Would you just create a cursor to iterate through all the entities to get the whole list? – paparazzo Aug 23 '11 at 18:50
  • @BalamBalam - Well... the "task is to find business objects which are exactly same as given object" and this query does that. It finds all EntityID's that is a match to the given parameter `@EntityId`. It is not meant to find **all** duplicates in one go. And I would not use a cursor for this. – Mikael Eriksson Aug 23 '11 at 21:00
  • @MikaelEriksson let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2788/discussion-between-balambalam-and-mikael-eriksson) – paparazzo Aug 23 '11 at 23:27
1

My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.

Approach might vary depending on the average number of properties objects will typically have, a few versus dozens.

Assuming that objects have a varying number of properties:

I would start with a composite non-unique index on the dyad (PropertyValues.PropertyId, PropertyValues.PropertyValue) for select-performance.

Then, given an entity ID, I would select its propertid, propertyvalue pairs into a cursor.

[EDIT: Not sure whether (entityid, propertyid) is unique in your system or if you are allowing multiple instances of the same property id for an entity, e.g. FavoriteColors:

              entityid  propertyid property value
                1            17     blue
                1            17     dark blue
                1            17     sky blue
                1            17     ultramarine

You would also need either a non-unique index on the monad (PropertyValues.entityid) or a composite index on (PropertyValues.entityid,PropertyValues.propertyid); the composite index would be unique if you wanted to prevent the same propertyid from being associated with an entity more than once.

If a property can occur multiple times, you should probably have a CanBeMultivalued flag in your Properties table. You should have a unique index on the triad (entityid, propertyid, propertyvalue) if you wanted to prevent this:

        entityid  propertyid property value
                1            17     blue
                1            17     blue

If you have this triad indexed, you would not need (entityid) index or the (entityid, propertyid) composite index in the PropertyValues table.

[/EDIT]

Then I would create a temp table to store matching entity ids.

Then I would iterate my cursor above to grab the given entity's propertyid, propertyvalue pairs, one pair at a time, and issue a select statement with each iteration:

      insert into temp
      select entityid from PropertyValues
      where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue

At the end of the loop, you have a non-distinct set of entityids in your temp table for all entities that had at least one of the properties in common with the given object. But the ones you want must have all properties in common.

Since you know how many properties the given object has, you can do the following to fetch only those entities that have all of the properties in common with the given object:

       select entityid from temp
       group by entityid having count(entityid) =  {the number of properties in the given object}

ADDENDUM:

After the first property-value pair of the given object is used to select all potential matches, your temp table would not be missing any possible matches; rather it would contain entityids that were not perfect matches, which must be discarded in some manner, either by being ignored (by your group by having... clause) or by being explicitly removed from the temp table.

Also, after the first iteration of the loop, you could explore the possibility that an inner join between the temp table and the PropertyValues table might offer some performance gain:

        select entityid from propertvalues
        >> inner join temp on temp.entityid = propertyvalues.entityid <<
        where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue

And you might also try removing entityids from temp after the first iteration:

        delete from temp
        where not exists
        (
         select entityid from propertyvalues
         inner join temp on temp.entityid = propertyvalues.entityid
         where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
        )

Alternatively, it would be possible to optimize this looping approach further if you stored some metadata about property-frequency. Optimally, when looking for matches for a given entity, you'd want to begin with the least frequently occuring property-value pair. You could order the given object's property-value pairs by ascending frequency, so that in your loop you'd be looking for the rarest one first. That would reduce the set of potential matches to its smallest possible size on the first iteration of the loop.

Of course, if temp were empty at any time after the given object's first property-value pair was used to look for matches, you would know that there are no matches for your given object, because you have found a property-value that no other entity possesses, and you could exit the loop and return a null set.

Tim
  • 5,371
  • 3
  • 32
  • 41
  • Objects will have dozens of properties (approximately from 20 to 120). Database will contain thousands of objects. – Egor4eg Aug 16 '11 at 12:23
  • Properties can not occur multiple times – Egor4eg Aug 16 '11 at 13:39
  • I'm not sure that this way can be effective. You propose to perform a separate SELECT statement for each property. So, if entity has 100 properties, Sql Server will have to execute SELECT statement 100 times. – Egor4eg Aug 16 '11 at 13:49
  • @egor4eg: yes, at the top of my answer I offered the caveat that the approach would be better suited to entities with only a few properties, not dozens. In your question, you show only three properties and explicitly mentioned 3. Still, I think it might not be as bad as you think. See the adddendum to my answer. – Tim Aug 17 '11 at 10:53
1

One way to look at this is if I have all base ball cards you have then we don't have the same baseball card as I may have more. But if you also have all the baseball cards that I have then we have exactly the same baseball cards. This is a little more complex as we are looking by team. By team could count the match count, my count, and your count and compare those 3 counts but that is 3 joins. This solution is 2 joins and I think it would be faster than the 3 join option.

To me the bonus questions did not make sense. There as a change to a table but that table name did not match any of the tables. Need a full table description for those bonus questions.

Below is the 2 join option:

    select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
    from
    (   select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
        from [PropertyValue] as [PV1] 
        left outer join [PropertyValue] as [PV2] 
            on  [PV2].[EntityId] <> [PV1].[EntityId]
            and [PV2].[PropertyId] = [PV1].[PropertyId]
            and [PV2].[PropertyValue] = [PV1].[PropertyValue] 
        group by [PV1].[EntityId], [PV2].[EntityId]
        having count(*) = count([PV2].[EntityId])
    ) as [m1]
    join 
    (   select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
        from [PropertyValue] as [PV1] 
        right outer join [PropertyValue] as [PV2] 
            on  [PV2].[EntityId] <> [PV1].[EntityId]
            and [PV2].[PropertyId] = [PV1].[PropertyId]
            and [PV2].[PropertyValue] = [PV1].[PropertyValue] 
        group by [PV1].[EntityId], [PV2].[EntityId]
        having count(*) = count([PV1].[EntityId]))
    ) as [m2]
    on [m1].[IDa] = [m2].[IDa] and [m1].[IDb] = [m2].[IDb] 

Below is the 3 join count based option:

    select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
    from
    (   select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb], COUNT(*) as [count]
        from [PropertyValue] as [PV1] 
        join [PropertyValue] as [PV2] 
            on  [PV2].[EntityId] <> [PV1].[EntityId]
            and [PV2].[PropertyId] = [PV1].[PropertyId]
            and [PV2].[PropertyValue] = [PV1].[PropertyValue] 
        group by [PV1].[EntityId], [PV2].[EntityId]
    )   as [m1]
    join 
    (   select [PV1].[EntityId] as [IDa], COUNT(*) as [count]
        from [PropertyValue] as [PV1] 
        group by [PV1].[EntityId]
        having count(*) = count([PV1].[sID]))
    )   as [m2]
    on [m1].[IDa] = [m2].[IDa] and [m1].[count] = [m2].[count]
    join 
    (   select [PV2].[EntityId] as [IDb], COUNT(*) as [count]
        from [PropertyValue] as [PV2] 
        group by [PV2].[EntityId]
    )   as [m3]
    on [m1].[IDb] = [m3].[IDb] and [m1].[count] = [m3].[count]
paparazzo
  • 44,497
  • 23
  • 105
  • 176
1

My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values).

if the "given objec"t is described as e.g. #PropertyValues, so the query would be:

create table #PropertyValues(
    [PropertyId] [int] NOT NULL,
    [PropertyValue] [int] NOT NULL
)

insert #PropertyValues
select
    3, 3 -- e.g.

declare
    @cnt int

select @cnt = count(*) from #PropertyValues

select 
    EntityId
from 
    PropertyValues pv
    left join  #PropertyValues t on t.PropertyId = pv.PropertyId and t.PropertyValue = pv.PropertyValue
group by
    EntityId
having 
    count(t.PropertyId) = @cnt
    and count(pv.PropertyId) = @cnt

drop table #PropertyValues

But if performance is so much critical, you can create special indexed field on table Entities, e.g. EntityIndex varchar(8000), which will be filled by trigger on PropertyValues table as convert(char(10), PropertyId) + convert(char(10), PropertyValue) (for all properties of entity, sorted!). So it will be possible to do very fast seek by this field.

Boogier
  • 609
  • 6
  • 24
  • I like you decision with EntityIndex column! In case if I want to enter data in this column automatically, I need to create a trigger for PropertyValues column, isn't it? In that case, there is no guarantee that data in EntityField and PropertyValues table are valid. They can be different if something will brake in index. Is there a more smart way to do populate EntityIndex with data? – Egor4eg Aug 24 '11 at 09:05
  • If you make a correct trigger - it will work ok. But it's another question already :) – Boogier Aug 24 '11 at 11:04
0

I think this is just a simple self-join:

select P2.EntityID,E.EntityName
from PropertyValues P1
inner join PropertyValues P2
on   P1.PropertyID    = P2.PropertyID
and  P1.PropertyValue = P2.PropertyValue
inner join Entity E
on P2.EntityID = E.EntityID
where P1.EntityId = 1
and   P2.EntityId <> 1
group by P2.EntityID, E.EntityName
Vinny Roe
  • 903
  • 4
  • 8
  • 1
    I think you think wrongly. Your solution will return the entities that have all or some properties in common with the given entity, not necessarily the entities that are *exactly* the same in terms of property sets and property values. – Andriy M Aug 16 '11 at 11:58
  • You're quite right Andriy M. Would this be better? select P2.EntityID,E.EntityName from PropertyValues P1 left join PropertyValues P2 on P1.PropertyID = P2.PropertyID and P1.PropertyValue = P2.PropertyValue inner join Entity E on P2.EntityID = E.EntityID where P1.EntityId = 1 and P2.EntityId <> 1 group by P2.EntityID, E.EntityName having count(P1.EntityID) = Count(P2.EntityID) ? I can't check that this syntax works, but something along these lines. Might be a problem with 2nd inner join. – Vinny Roe Aug 16 '11 at 16:08
  • I'm afraid not. The HAVING clause changes nothing. Both COUNTs will always return the same result. It would be great if the query would be that simple, but in fact it has to be a bit more sophisticated. – Andriy M Aug 16 '11 at 16:16