Does statistics help fields on JOINS like hashing does in Azure Data Warehouse?

Question

In Azure Data Warehouse, it is recommended to use HASH for distributed tables on the fields you plan to JOIN and potentially even aggregate on. It's also recommended that you create statistics on the fields you plan to use.

Let's say you have two tables with the same amount of records and fields. One table is hashed on a very unique key where the other is ROUND_ROBIN, where data is randomly divided evenly across the 60 databases.

-- CustomerID is alphanumeric
SELECT
   [ProductID]
  ,COUNT(DISTINCT [CustomerID]) AS [Unique Records]
FROM [dbo].[FactTable]
GROUP BY
   [Product]

On the hashed table, if you aggregate on the hashed key, you could see it's returning the results within 0.05 seconds. On the round robin table, with the same aggregation, it's 0.51 seconds.

CREATE STATISTICS [ProductID] ON [dbo].[FactTable] ([ProductID]);
CREATE STATISTICS [CustomerID] ON [dbo].[FactTable] ([CustomerID]);

When you apply statistics to the fields you are aggregating, the hashed table still returns within 0.05 seconds. No change. When you do the same to the round robin table, the results return to same as the hashed within 0.05 seconds.

Note: 2000 DWU running query in xlarge resource (max memory allocation)

When examining the distribution of the hashed key, I found 140 million records are stored in one distribution of the 60 databases. 50 million other records are distributed pretty evenly across the other 59 databases. A default value is making my hashed key not a great candidate for a hash.

My question is, while the round robin table with statistics is performing great on the same field I am hashing on with the other table, will the same performance stick when I use the round robin on JOINS with that key? I haven't tested it fully yet, but looking for the best approach.

Does statistics help on joins? Some of the articles I read say they do, but seeing if anyone else has more solid answers on the approach when considering round robin over a distribution key. I don't really have any good candidates that don't cause data skew like in my above example with the 140 million.

If you want the answers to be more concrete it would help if you would post the explain plan for all the queries you tested. Put the word `EXPLAIN` in front of the query and run it then copy the XML into your question. — GregGalloway, Apr 15 '17 at 11:58
Don't really have one, but included SQL sample for reference. It's just a basic SELECT COUNT(DISTINCT) GROUP BY query. — Fastidious, Apr 15 '17 at 12:15
can you get one? I would like to see if stats change the explain plan or if performance differences are just related to caching. — GregGalloway, Apr 15 '17 at 12:17
I'll see if I can. I don't believe it's caching. I ran the query multiple times at the same speeds (0.51) before applying the statistics. It changed instantly. I don't really notice any caching in Azure DW. — Fastidious, Apr 15 '17 at 12:25

score 1 · Accepted Answer · answered Apr 15 '17 at 23:57

First of all, in the current version of Azure SQL Data Warehouse, you should always create statistics on columns you will be using in joins, GROUP BY etc. Ignore the timings you got without statistics as they are irrelevant.

One of the criteria for a good hash key, is that it distributes the data evenly. If you don't have a good candidate, then ROUND_ROBIN is the alternative. The thing to bear in mind about ROUND_ROBIN is that you always get data movement, but sometimes that is fine. It really depends on what your key queries are. My advice would be to optimise for them.

For the examples you have give, the performance is so quick that it's really not worth spending too much time on. Do you have some more realistic queries?

Azure SQL Data Warehouse does cache, similar to SQL Server, as described here.

I mocked up a test rig using the AdventureWorksDW sample available when you provision a warehouse from the portal, and found the hashed table performed significantly better (despite large skew) for these sample queries but your mileage may vary (YMMV):

-- 603,980,000 rows
CREATE TABLE dbo.FactTable_rr
WITH ( 
    DISTRIBUTION = ROUND_ROBIN, 
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT a.ProductKey AS [ProductID], a.CustomerKey AS [CustomerID]
FROM [dbo].[FactInternetSales] a
    CROSS JOIN ( SELECT TOP 1000 1 FROM [dbo].[FactInternetSales] ) b(c)
    CROSS JOIN ( SELECT TOP 10 1 FROM [dbo].[FactInternetSales] ) c(c)
GO


CREATE STATISTICS st_dbo_FactTable_rr_ProductID ON dbo.FactTable_rr ( ProductID ) WITH FULLSCAN;
CREATE STATISTICS st_dbo_FactTable_rr_CustomerID ON dbo.FactTable_rr ( CustomerID ) WITH FULLSCAN;
GO


CREATE TABLE dbo.FactTable_hh
WITH ( 
    DISTRIBUTION = HASH( [ProductID] ),
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT *
FROM FactTable_rr
GO

CREATE STATISTICS st_dbo_FactTable_hh_ProductID ON dbo.FactTable_hh ( ProductID ) WITH FULLSCAN;
CREATE STATISTICS st_dbo_FactTable_hh_CustomerID ON dbo.FactTable_hh ( CustomerID ) WITH FULLSCAN;
GO



-- Find data skew for a distributed table
DBCC PDW_SHOWSPACEUSED('dbo.FactTable_rr');
DBCC PDW_SHOWSPACEUSED('dbo.FactTable_hh');
GO


--EXPLAIN
SELECT
   [ProductID],
   COUNT(DISTINCT [CustomerID]) AS [Unique Records]
FROM [dbo].[FactTable_rr]
GROUP BY [ProductID]
OPTION ( LABEL = 'rr' );


--EXPLAIN
SELECT
   [ProductID],
   COUNT(DISTINCT [CustomerID]) AS [Unique Records]
FROM [dbo].[FactTable_hh]
GROUP BY [ProductID]
OPTION ( LABEL = 'hh' );


-- Summary
SELECT [label], COUNT(*) records, CAST( AVG(total_elapsed_time) / 1000. AS DECIMAL(10,2) ) total_elapsed_time_s
FROM sys.dm_pdw_exec_requests 
WHERE [label] IS NOT NULL
  AND command Like 'select%'
GROUP BY [label];

My results, with a table of 600 million rows:

If I look at the two EXPLAIN plans for the two queries (rr, hh), I see a much simpler plan for the hash table with no data movement. The 'rr' plan contains a SHUFFLE_MOVE operator, which redistributes a distributed table.

My problem is that even on simple queries, it's hanging on one billion rows . SImple queries that are grouping on doing DATEPART and GROUP BY on highly unique rows in order to avoid doing COUNT(DISTINCT) on those same highly unique rows is taking a long time. — Fastidious, Apr 16 '17 at 13:12
That sounds like a completely different scenario to the one you have described about (where performance is sub-second)? If you could provide the DDL, some sample data and actual queries you are running and the explain plans, then someone will be able to help you. Honestly the scenario you have just described sounds like a separate question — wBob, Apr 16 '17 at 13:22
It's basically the same query, but on a different table with more records. The only difference is there is one more filter based on a set of data that has a billion records in that filter, where the example query I showed is based on another subset of data that is only 200 million records that I split out to another table. Being I split it out, the query was able to return in seconds versus many minutes. But yes, it can be another question. Why can't the same query work on bigger data? :) — Fastidious, Apr 16 '17 at 13:23
Just for context, most of my trouble is coming from count distinct or preaggregating the data to do GROUP BY to get the same results as count distinct (to avoid using DISTINCT clause). In both scenarios, they are taking long times with bigger datasets. So I split out the tables, did one on hash the other round robin to see if there is any performance gain doing hash over round robin. — Fastidious, Apr 16 '17 at 13:28
For similar tests, will have to use a CustomerID that is alpha numeric (50) and had has over 20 million uniques with a larger amount default to one value that is greater than 100 million. The CustomerID you're using is like a INT based data type. This will be a lot faster aggregating. — Fastidious, Apr 16 '17 at 13:44
I can try and alter my rig to more closely reflect your scenario, but yes if this query is a key business / reporting requirement then you might consider pre-aggregating for it. Obviously that introduces latency. Regarding "hanging" - I notice you are using the xlargerc resource group which really reduces concurrency. An xlargerc query will take 64 of the 80 concurrency slots available at DWU2000, meaning you can't have more than one xlargerc executing at the same time. You could have one other mediumrc, or many smallrc concurrently. Check for queuing queries in `sys.dm_pdw_exec_requests` — wBob, Apr 16 '17 at 14:01
I think the other option would be to replace the alphanumeric with a int based data type. That's likely my only option to improve performance. On the question though, it does seem statistics help a great deal on JOIN's. — Fastidious, Apr 16 '17 at 14:08
How did you get on with this? Would you consider upvoting or marking as answer which confirms statistics are important? — wBob, Apr 26 '17 at 18:19

Does statistics help fields on JOINS like hashing does in Azure Data Warehouse?

1 Answers1