DISCLAIMER: This is not a general question about the problems of NOLOCK (and therefore not a duplicate of Is the NOLOCK (Sql Server hint) bad practice?); it is a specific question about how NOLOCK and DISTINCT interact in an attempt to better understand SQL Server's inner workings.
As strange as it may seem, it appears to me that NOLOCK may be causing DISTINCT to fail in a certain case. Here is the example:
INSERT INTO TableA (ID)
SELECT DISTINCT ID
FROM TableB WITH (NOLOCK)
The above example occasionally produces a PK violation. Here are the other relevant facts:
- PK of TableA is ID
- PK of TableB is ID
- TableA is empty when this starts.
- Nothing else is writing to TableA during this time.
- There are updates happening to TableB while the above is running.
My working theory is that 1) the updates on TableB combined with the use of NOLOCK are causing duplicate data, and 2) the optimizer is relying on the fact that TableA has a PK on the same column that we are DISTINCTing, and so doesn't actively perform a DISTINCT operation on the rows that are being returned, it just assumes the rows will already be distinct.
Can anyone confirm this? And if so, is this by design, or a bug in SQL Server?
I originally thought even with dirty reads and the possibility of duplicate rows that DISTINCT would be a guarantee to clean up the duplicates, but the evidence I'm seeing seems to indicate otherwise.
This error was seen on SQL Server 2008R2.