After 20 years of professional development, I still find myself feeling completely clueless when it comes to certain aspects of database performance. This is one of those times. There are thousands of questions here and elsewhere regarding table and index fragmentation and its effect on performance. I know the basic do's and don'ts, but there are times when it seems there is no "good" answer. Here's my problem, and I run across it fairly often:
A table is used only to store id pairs that define one-to-many relationships, let's use an example of friends. A Friends table contains only personId (int), friendId (int). Each pair is of course unique. (And consequently, but probably irrelevant to the question, the inverse pair to each relationship is also present.) So a very small sample of the data would be:
1001, 1011
1001, 1012
1001, 1013
1011, 1001
1012, 1001
1013, 1001
etc...
Person 1001 has 3 friends, and of course each of those friends has person 1001 as a friend, etc. This table would potentially have millions, even as much as hundreds of millions, of relationships (rows), any given person would be likely to have in the neighborhood of hundreds of friends. And they would be inserted and updated (actually in this case some existing ones deleted, new ones added, no actual row updates) frequently, and in no specific order. For any given batch of inserts, they could be ordered by personId, friendId, but beyond that, inserts would largely be unordered over time.
The usage then of this table would be to query all friends of a given person, or inner join to a query of people to group and aggregate additional data relating to each person's friends, etc., the typical uses you would expect for a table of one-to-many relationships. Query performance would probably be more critical than insert performance, but both would be important, as both would be occurring frequently. Example query:
SELECT p.Name FROM Friends f
INNER JOIN People p ON f.friendId = p.id
WHERE f.personId = @personId
In the past, without even thinking twice, I would give the table a composite primary key of personId, friendId, which in SQL Server would by default be created as a clustered index, and be done with it. But I've never dealt with such large and performance-critical data as this before, so I'm questioning this decision. I don't see any way in which a table like this could be structured in a way that didn't result in significant and frequent fragmentation. My questions are:
Is there a better way to structure this data?
Considering that the two int columns of the clustered index represent the only data in the table, is fragmentation likely to be as bad as I am assuming, and if so, will the fragmentation under these conditions result in as significant a performance hit as I'm assuming?
(Unless there is some radically different concept in RDBMS I'm not familiar with, I'm assuming the answer to the first question is no. So it's mainly the second question I'm hoping someone has a good base of experience to answer from. BTW, the database is SQL Azure if that makes a difference.)
Thanks, to those DBA gurus among you who have some insight to offer!