4

After 20 years of professional development, I still find myself feeling completely clueless when it comes to certain aspects of database performance. This is one of those times. There are thousands of questions here and elsewhere regarding table and index fragmentation and its effect on performance. I know the basic do's and don'ts, but there are times when it seems there is no "good" answer. Here's my problem, and I run across it fairly often:

A table is used only to store id pairs that define one-to-many relationships, let's use an example of friends. A Friends table contains only personId (int), friendId (int). Each pair is of course unique. (And consequently, but probably irrelevant to the question, the inverse pair to each relationship is also present.) So a very small sample of the data would be:

1001, 1011
1001, 1012
1001, 1013
1011, 1001
1012, 1001
1013, 1001
etc...

Person 1001 has 3 friends, and of course each of those friends has person 1001 as a friend, etc. This table would potentially have millions, even as much as hundreds of millions, of relationships (rows), any given person would be likely to have in the neighborhood of hundreds of friends. And they would be inserted and updated (actually in this case some existing ones deleted, new ones added, no actual row updates) frequently, and in no specific order. For any given batch of inserts, they could be ordered by personId, friendId, but beyond that, inserts would largely be unordered over time.

The usage then of this table would be to query all friends of a given person, or inner join to a query of people to group and aggregate additional data relating to each person's friends, etc., the typical uses you would expect for a table of one-to-many relationships. Query performance would probably be more critical than insert performance, but both would be important, as both would be occurring frequently. Example query:

SELECT p.Name FROM Friends f
INNER JOIN People p ON f.friendId = p.id
WHERE f.personId = @personId

In the past, without even thinking twice, I would give the table a composite primary key of personId, friendId, which in SQL Server would by default be created as a clustered index, and be done with it. But I've never dealt with such large and performance-critical data as this before, so I'm questioning this decision. I don't see any way in which a table like this could be structured in a way that didn't result in significant and frequent fragmentation. My questions are:

  1. Is there a better way to structure this data?

  2. Considering that the two int columns of the clustered index represent the only data in the table, is fragmentation likely to be as bad as I am assuming, and if so, will the fragmentation under these conditions result in as significant a performance hit as I'm assuming?

(Unless there is some radically different concept in RDBMS I'm not familiar with, I'm assuming the answer to the first question is no. So it's mainly the second question I'm hoping someone has a good base of experience to answer from. BTW, the database is SQL Azure if that makes a difference.)

Thanks, to those DBA gurus among you who have some insight to offer!

reads0520
  • 666
  • 8
  • 17
  • You need to maintain an index, whether it's clustered or not. Presumably the index would need to include both columns (or you'll be doing a *lot* of RID lookups), so it might as well be a clustered index. Disclaimer: not a DBA. – Blorgbeard Jan 22 '14 at 03:39
  • Yes, so obviously I need an index, clustered or not. If, alternately, I gave every row a sequential identity id as primary key / clustered index, e.g. **relationId**, since inserts would happen frequently and deletes infrequently the clustered index should stay relatively unfragmented, but also would never be used - I'd need a non-clustered index on **personId, friendId**, which would be used, and it would be the one that would get horribly fragmented, accomplishing nothing, right? – reads0520 Jan 22 '14 at 04:05
  • 1
    You do understand that the clustered index is the table? – Namphibian Jan 22 '14 at 05:18
  • @Namphibian Yes, I understand that the clustered index is in fact the table itself. But good clarification for others. – reads0520 Jan 22 '14 at 06:40
  • @Namphibian Yes, that's exactly what I meant. – Blorgbeard Jan 22 '14 at 09:26

2 Answers2

4

You neen only clustered index including both fields. Index is ordered data whether it is clustered or not. If you make non-clustered index your data will be doubled and every insert operation will need doubled resources because it will insert data both in heap (or row_id clustered index) and non-clustered index. But seek operation will use only non-clustered index because all needed data is included in it.

So make clustered index and be happy :)

Boogier
  • 609
  • 6
  • 24
  • Thanks Boogier. In this case, I would be willing to sacrifice some bytes for performance, but as you've pointed out, I don't think there is anything to be gained. I suspect I'll just have to watch the fragmentation and performance, and correct (rebuild) as necessary. Unless someone else offers some great insights on the likely degree and effects of fragmentation in this scenario, I'll mark this as correct. – reads0520 Jan 22 '14 at 06:45
-1

You can keep re-organizing table (CTAS etc) over a period of time to address fragmentation.

However, more than anything else, i would suggest tuning SQL and wd highly discourage JOINing 'people' with 'friend' as people also seems to be large table in this case.

To get your query perform faster, I would, first, tweak your SQL as:

SELECT f.*, p.NAME FROM 
(
SELECT personId, friendId FROM Friends f
WHERE f.personId = @personId
) f
, People p ON f.friendId = p.ID

Give it a try and see...

pahariayogi
  • 1,073
  • 1
  • 7
  • 18
  • So.. you're joining from people to a subquery on friends, instead of directly? I don't see this being faster.. – Blorgbeard Jan 22 '14 at 09:28
  • @Blorgbeard Its has made a a difference for me a no of similar occasions in Oracle. Not sure abt SQL Server/Azure. But, i believe it should make a difference. The Sub-query should run first pulling a hand full of friends for a person. This resultset can then be joined with 'people' table to fetch NAME. – pahariayogi Jan 22 '14 at 10:26
  • Any execution plan I would have seen in the past several years generated by SQL Server would apply the filter (WHERE) prior to looking up the rows in the People table, and would perform very well. Some people have preferences of using or not using the JOIN syntax, but I think the performance would not differ. May even generate identical execution plans. – reads0520 Jan 22 '14 at 22:35
  • @reads0520 can u share execution plan? also, check out the folloing which suggest this sub-query should be given preference by SQL Server:- http://stackoverflow.com/questions/2263186/in-which-sequence-are-queries-and-sub-queries-executed-by-the-sql-engine – pahariayogi Jan 24 '14 at 08:34
  • @pahariayogi I would say that post makes my point exactly, when he states that it will decide on the fly what it thinks makes the most sense. And in a simple case like my example query above, it will not choose to join a million rows and then filter it down to a few, it will filter and then join on the result. – reads0520 Jan 25 '14 at 16:43