0

Suppose a database table stores some file metadata. Each file can be identified by a globally unique file ID. Each file is located within a folder (that can store multiple files) which also has a globally unique ID. So the table has, among other columns, two unique identifiers:

FileID (GUID/uniqueidentifier)
FolderID (GUID/uniqueidentifier)

Note that each FileID in the table is supposed to be different (assigned a random GUID) while the same FolderID can appear multiple times. To obtain a specific file record, only FileID can be used:

SELECT * FROM table WHERE FileID=...

My main question is: is there any performance benefit in explicitly specifying FolderID together with FileID to limit the number of records to search? Like this:

SELECT * FROM table WHERE FileID=... AND FolderID=...

Which way should be used, the first one, the second one, does it matter at all? Does it depend on certain conditions such as indexing, field cardinality etc.? How smart is SQL Server when it comes to optimizing queries like this? Is the order of conditions relevant (i.e. WHERE FileID=... AND FolderID=... vs WHERE FolderID=... AND FileID=...)? The only potential benefit of specifying FolderID superficially seems to be some protection against the highly unlikely FileID GUID collision.


My initial speculation (not knowing how queries are executed internally) was like this: if we ignore block size and assume both fields are indexed (assuming B-trees or any such logN structures), then in the first case (using only FileID) the search time complexity when X files are stored will be: log2(X)

If X files are distributed uniformly among d folders, each folder will contain f files and the search complexity becomes: log2(d) + log2(f) = log2(d*f) = log(X) - no difference (plus some potential overhead in real-life, but it doesn't impact the complexity itself). This is assuming that FolderIDs are searched first, and then a subset of FileIDs. If neither field is indexed there is no obvious difference either.

However, suppose only FileID is indexed while FolderID is not (linear search with N/2 average complexity applies) - now if we use the second form for the query, the search complexity becomes d/2 + log2(f) which can be substantially worse than using only FiledID with log2(X) - e.g. when X=1 million files are distributed in d=50000 folders - meaning f=20 files per folder.

Will SQL Server detect things like this and act accordingly?

w128
  • 4,680
  • 7
  • 42
  • 65
  • 1
    See [Sql statement processing](http://msdn.microsoft.com/en-us/library/ms190623%28v=sql.105%29.aspx) and [How to obtain execution plan](http://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan). – Nikola Markovinović Jun 27 '13 at 14:31

1 Answers1

2

You are missing the power of Index Seek
Scans vs. Seeks

If you want to optimize performance of a FolderID, FileID select then make FolderID, FileID a cluster PK (and in that order).
Specify both in the select.
You will get an index seek.

Or just make FileID a PK and only search on FileID.
You will again get an index seek.

If FileID is the PK then you would need to add an index on FolderID if you want to speed search on FolderID alone.
That FolderID index will take space.
Clustered index takes no (additional) space but you only get one.

An index seek is very fast.

paparazzo
  • 44,497
  • 23
  • 105
  • 176