Suppose a database table stores some file metadata. Each file can be identified by a globally unique file ID. Each file is located within a folder (that can store multiple files) which also has a globally unique ID. So the table has, among other columns, two unique identifiers:
FileID (GUID/uniqueidentifier)
FolderID (GUID/uniqueidentifier)
Note that each FileID
in the table is supposed to be different (assigned a random GUID) while the same FolderID
can appear multiple times. To obtain a specific file record, only FileID
can be used:
SELECT * FROM table WHERE FileID=...
My main question is: is there any performance benefit in explicitly specifying FolderID
together with FileID
to limit the number of records to search? Like this:
SELECT * FROM table WHERE FileID=... AND FolderID=...
Which way should be used, the first one, the second one, does it matter at all? Does it depend on certain conditions such as indexing, field cardinality etc.? How smart is SQL Server when it comes to optimizing queries like this? Is the order of conditions relevant (i.e. WHERE FileID=... AND FolderID=...
vs WHERE FolderID=... AND FileID=...
)? The only potential benefit of specifying FolderID
superficially seems to be some protection against the highly unlikely FileID
GUID collision.
My initial speculation (not knowing how queries are executed internally) was like this: if we ignore block size and assume both fields are indexed (assuming B-trees or any such logN structures), then in the first case (using only FileID
) the search time complexity when X files are stored will be: log2(X)
If X files are distributed uniformly among d folders, each folder will contain f files and the search complexity becomes: log2(d) + log2(f) = log2(d*f) = log(X)
- no difference (plus some potential overhead in real-life, but it doesn't impact the complexity itself). This is assuming that FolderIDs
are searched first, and then a subset of FileIDs
. If neither field is indexed there is no obvious difference either.
However, suppose only FileID
is indexed while FolderID
is not (linear search with N/2 average complexity applies) - now if we use the second form for the query, the search complexity becomes d/2 + log2(f)
which can be substantially worse than using only FiledID
with log2(X)
- e.g. when X=1 million files are distributed in d=50000 folders - meaning f=20 files per folder.
Will SQL Server detect things like this and act accordingly?