optimizing a view for hierarchical data

Question

I currently have a query that contains a self-join to query all the direct and indirect managers of an employee from a table that stores company organization info using the nested sets model. In this SQL notation, numbers preceded by a colon (e.g. :1) are variables:

select parent.empid, parent.depth from RelationshipMgr as node join
RelationshipMgr as parent on node.lft between parent.lft and parent.rgt
and node.empid = :1 order by parent.lft

I can trivially return only the ID of a manager n levels above the employee by adding parent.depth = node.depth - :2 to either the join condition or a where clause (side question: which is faster?).

The problem: I'm trying to turn this query into a view, and I'm not having much luck. The problem is that most or all of my variables are in the join condition of my query. My current best plan is to break those parts out into columns which I can then use the where clause on when I query the view, for example this:

select node.EmpID, parent.empid as MgrID, parent.depth as MgrDepth,
node.depth - parent.depth as MgrRelativeAltitude from RelationshipMgr as node
join RelationshipMgr as parent on node.lft between parent.lft and parent.rgt

You can see I've had to invent the MgrRelativeAltitude column to be able to find the ID of a manager n levels above the employee, but that is hardly the biggest problem. I worry that this will have serious performance problems, since SQL Server appears to do the full join as specified by the join conditions, and then filter it by the where clause, rather than intelligently using the where clause to limit the join. Is there a better way to create the view? Should I leave this as a query and forget about making a view? Would I gain anything by making it a stored procedure instead of a view?

And please don't say "premature optimization is evil"... it's not premature. The implementation I'm replacing used something like a bastardized adjacency list that had a record relating an employee to ever one of his direct and indirect managers... worst case O(n^2) records, and predictably ran into serious performance problems when we had more than about 300000 employees in the hierarchy. My new nested-sets implementation will alleviate those performance problems, except for this one query... if you do a select * on the proposed view, the results will be nearly identical to the old table I'm trying to replace, and that concerns me very much.

Without seeing the tables I am not sure how the data is structured (not sure what LFT / Rgt are.) If you are on a relatively new SQL Server; i would take a look at CTE for handling your query - you can quite often make something complex a lot easier to read. This page on MSSQL Tips uses a similar example which may be of help http://www.mssqltips.com/tip.asp?tip=1520 — u07ch, Aug 04 '10 at 09:06
@u07ch it's a garden variety nested-sets model... there really isn't anything special about my table structure. "lft" and "rgt" are the left and right columns (occasionally also known as down and up, respectively) of the nested sets technique. Those names are fairly standard since "left" and "right" are reserved words in SQL. I provided a link in my question, but here is another (scroll down to the section on nested sets): http://dev.mysql.com/tech-resources/articles/hierarchical-data.html — rmeador, Aug 04 '10 at 14:03

score 0 · Answer 1 · edited May 23 '17 at 12:07

You are trying to determine the hierarchical relationship of non-adjacent nodes. As you've found, this is a relatively expensive run time calculation, view or regular query. Instead, if frequently run, I'd suggest creating what's known as a bridge table - either as a real table updated via trigger or as a indexed view in SQL Server 2005+ (albeit haven't tried the indexed view approach). That noted, nested set provides superior read times compared to adjacency list anyway.

The trade off is a table with significantly more rows than the source because it effectively represents relationships between all nodes that slows down writes as it updates any time nodes are added, deleted, or parent ids change. In return you can index it and achieve fast retrieval times. An optimization, if updating the bridge proves a bottleneck is accessing this through a stored procedure where the bridge serves as a cache for frequently run combinations of inputs, but calculates infrequent cases at run-time. You'll need to evaluate read vs. write frequency of the underlying node table to make that determination.

An overview of options for representing hierarchical data in a RDBMS is available here.

optimizing a view for hierarchical data

1 Answers1