4

OK, first a disclaimer. I'm using an Entity Attribute Value approach in a couple of my tables. So basically I have a List of attributes in a single column in one table that I want to then populate it in a single row in a seperate view.

I found this solution and it works great:

SQL: Dynamic view with column names based on column values in source table

However the initial load was extremely slow (it took over 27 minutes to populate 514 rows). I thought something didn't seem right at all so I messed around with selecting portions of the Client table using TOP. I got instant results. I found that I could instantly queue the entire database this way. However I found a very weird caveat. The most I could select was 5250 records.

Up to this point I was still getting instant results. If I tried to select 5251, the query hangs. I tried it on a test server and got the same limitations but with a different number (I could select a max of 5321 there). Keep in mind the table has only 514 records, so I have no idea why adding 1 number to a TOP select would cause it to hang. Does anyone have any input in this? Here's my working sql query below:

DECLARE @cols AS NVARCHAR(MAX)
DECLARE @query  AS NVARCHAR(MAX)

select @cols = STUFF((SELECT distinct ',' + QUOTENAME(a.AttributeName) 
                from AttributeCodes a                                   
        FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)') 
    ,1,1,'')


set @query = 'SELECT TOP 5250 ClientID, ' + @cols + ' from 
         (
             select c.ClientID
                    , c.[Value]
                    , a.AttributeName                                           
                from Client c
                inner join AttributeCodes a
                    on a.AttributeCodeId = c.AttributeCodeID

        )x
        pivot 
        (
            min([Value])
            for AttributeName in (' + @cols + ')
        )p'

execute(@query)

EDIT:

OK it seems as though the problem is that the execution plan is completely changed by adding another digit. I'll post the two results below. I still don't know why it would change, and if there is any way I can prevent it from using a Hash Match instead of an Inner Join.

Execution Plan 1 (instant):
Execution Plan 1

Execution Plan 2 (30+ minutes): Execution Plan 2

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
  • Check the total of the column sizes for the 5251st row. If it is too big I'd expect an error rather than a hang. – HABO Oct 09 '12 at 17:03
  • That's the thing though, there are only 514 rows so there is no 5251st row, for some reason though once I cross that arbitrary threshold, I go from an instant result to one that takes 10-20 minutes to load. – Daniel Thompson Oct 10 '12 at 04:42
  • How many distinct attribute names are there in the 514 rows? – HABO Oct 10 '12 at 17:37
  • Looks like I have 117 distinct attribute columns...thought I'm not seeing a correlation b/w attribute count & attribute rows. Thanks for your help so far though! – Daniel Thompson Oct 13 '12 at 17:35
  • 1
    Please post both execution plans for `TOP 5250` vs `TOP 5251`. Altering the value can change the plan [as alluded to here](http://dba.stackexchange.com/a/24848/3690) – Martin Smith Oct 15 '12 at 06:27
  • Wow you are absolutely right. Adding that 1 digit completely altered the execution plan! I'll post them in a minute. – Daniel Thompson Oct 17 '12 at 16:04
  • Sorry, I'm having problems posting images in the comments...not quite used to the format here yet... – Daniel Thompson Oct 17 '12 at 16:22

2 Answers2

1

Without looking at the exact index design and knowing exactly the size and scope of what you're selecting and what's in the database, I can't give you the exact reason.

But what I can tell you is that there are cost thresholds that SQL Server uses to come up with plans about whether it's more advantageous to scan a table or perform a large amount of seeks. What's happening here is that SQL Server seems to be crossing that threshold at the 5250/5251 boundary.

If you were to add many more data rows to your main table, rebuild the statistics, and then re-query, you would likely find that SQL Server would stop scanning and hashing and would return to the lookup-based plan because the proportion of rows it would have to repeatedly seek would once again be lower.

So, how to eliminate this problem? I'll start out by saying that EAV can be fine for certain types of designs in certain scenarios, but when you're pulling a lot of data from the system, or you want reports off of these models, EAV becomes problematic, and these are the types of problems you tend to see. Since you're selective

You have a couple of possible solutions here.

  1. Add a LOOP JOIN hint to your join. Yuck.
  2. Be more selective in what you're asking from the database -- for example, just ask it for the values for a single client at a time.
  3. Re-evaluate your reasons for using the EAV model and redesign your data model using conventional techniques.

Again, I don't know much about your reasoning for this design, so if I were in your position and my back were against the wall, I would probably do a combination of 1) and 3). I'd add the hint to "staunch the bleeding" as it were, and then I'd consider re-evaluating the decision to use an entity-attribute-value design altogether.

Dave Markle
  • 95,573
  • 20
  • 147
  • 170
  • OK thanks for your input. I've definitely thought of scrapping EAV altogether, but I'm not sure how we can mold a conventional approach into our needs. Basically, we have dozens of sites with specific sets of data being collected. A lot of it is overlapping data we'd want to report on across the board, but others are site specific. The data needed often changes based on funding and other aspects, so that's why EAV was so attractive. With a conventional approach we'd constantly be updating columns and forms, but I agree that reporting would be alot easier. Any suggestions/links are welcome! :) – Daniel Thompson Oct 17 '12 at 19:20
0

The following is a formatted comment, not an answer.

Out of morbid curiosity, I've been fiddling about with the following test code. I realize that is does not have the proper code for checking for the existence of the test tables, that the edge cases for the random numbers are undoubtedly wrong, ... . So it goes.

The intent is to create considerably more test data and larger results than those described in the question. With the parameters shown it requires roughly 306 seconds to initialize and 87 seconds to run the dynamic query on my notebook. (Windows 7 Professional 64-bit, 16GB memory, SQL Server 2008 R2 64-bit.) I've seen no indications of any difficulties. Daniel, do you see any obvious differences e.g. a different data type for the Value column that might be larger? Version of SQL Server? Available memory? Did I completely muck up your EAV representation?

-- Test parameters.
declare @NumberOfAttributes as Int = 1000
declare @NumberOfClients as Int = 1000
declare @NumberOfSampleRows as Int = 1000000
declare @NumberOfTopRows as Int = 10000
declare @Start as DateTime = GetDate()

-- Houseclean any prior data.
if Object_Id( 'AttributeCodes' ) is not NULL
  drop table AttributeCodes
if Object_Id( 'Client' ) is not NULL
  drop table Client

-- Create the tables.
create table AttributeCodes (
  AttributeCodeId Int Identity(1,1) not NULL,
  AttributeName VarChar(64) not NULL )
create table Client (
  ClientId Int not NULL,
  AttributeCodeId Int not NULL,
  [Value] VarChar(64) not NULL )

set nocount on

-- Generate some sample attributes.
declare @Count as Int
set @Count = @NumberOfAttributes
while ( @Count > 0 )
  begin
  insert into AttributeCodes ( AttributeName ) values
    ( 'Attr_' + Right( '000' + Cast( @Count as VarChar(8) ), 4 ) )
  set @Count = @Count - 1
  end

-- Generate some sample client data.
declare @ClientId as Int
declare @AttributeCodeId as Int
set @Count = @NumberOfSampleRows
while ( @Count > 0 )
  begin
  set @ClientId = 1 + Cast( Rand() * @NumberOfClients as Int )
  set @AttributeCodeId = 1 + Cast( Rand() * @NumberOfAttributes as Int )
  insert into Client ( ClientId, AttributeCodeId, Value ) values
    ( @ClientId, @AttributeCodeId, Replicate( 'i', Cast( Rand() * 64 as Int ) ) )
  set @Count = @Count - 1
  end

-- Build the list of columns.
declare @Cols as NVarChar(Max)
select @Cols = Stuff(
  ( select ',' + QuoteName( AttributeName ) from AttributeCodes order by AttributeName for XML path(''), type).value( '.[1]', 'NVarChar(max)' ), 1, 1, '' );

-- Build an execute the summary query.
declare @Query as NVarChar(Max)
set @Query = 'select top (' + Cast( @NumberOfTopRows as VarChar(8) ) + ') ClientId, ' + @Cols +
  ' from ( select C.ClientId, C.[Value], A.AttributeName from Client as C inner join AttributeCodes as A on A.AttributeCodeId = C.AttributeCodeId ) as X' +
  ' pivot ( Min( [Value] ) for AttributeName in (' + @cols + ') ) as P' 
declare @InitializationComplete as DateTime = GetDate()
execute( @Query )
select DateDiff( ms, @Start, @InitializationComplete ) as 'Initialization (ms)',
  DateDiff( ms, @InitializationComplete, GetDate() ) as 'Query (ms)',
  DateDiff( mi, @Start, GetDate() ) as 'Total (min)'
HABO
  • 15,314
  • 5
  • 39
  • 57