In Azure Data Warehouse, it is recommended to use HASH for distributed tables on the fields you plan to JOIN and potentially even aggregate on. It's also recommended that you create statistics on the fields you plan to use.
Let's say you have two tables with the same amount of records and fields. One table is hashed on a very unique key where the other is ROUND_ROBIN, where data is randomly divided evenly across the 60 databases.
-- CustomerID is alphanumeric
SELECT
[ProductID]
,COUNT(DISTINCT [CustomerID]) AS [Unique Records]
FROM [dbo].[FactTable]
GROUP BY
[Product]
On the hashed table, if you aggregate on the hashed key, you could see it's returning the results within 0.05 seconds. On the round robin table, with the same aggregation, it's 0.51 seconds.
CREATE STATISTICS [ProductID] ON [dbo].[FactTable] ([ProductID]);
CREATE STATISTICS [CustomerID] ON [dbo].[FactTable] ([CustomerID]);
When you apply statistics to the fields you are aggregating, the hashed table still returns within 0.05 seconds. No change. When you do the same to the round robin table, the results return to same as the hashed within 0.05 seconds.
Note: 2000 DWU running query in xlarge resource (max memory allocation)
When examining the distribution of the hashed key, I found 140 million records are stored in one distribution of the 60 databases. 50 million other records are distributed pretty evenly across the other 59 databases. A default value is making my hashed key not a great candidate for a hash.
My question is, while the round robin table with statistics is performing great on the same field I am hashing on with the other table, will the same performance stick when I use the round robin on JOINS with that key? I haven't tested it fully yet, but looking for the best approach.
Does statistics help on joins? Some of the articles I read say they do, but seeing if anyone else has more solid answers on the approach when considering round robin over a distribution key. I don't really have any good candidates that don't cause data skew like in my above example with the 140 million.