0

ORIGINAL PROBLEM

I have created a custom script to retrieve data from a remote SQL server into our local copy in our office. I had some issues with the script where selected tables had some data inserted twice, thus creating duplicates. I know that for all the tables in all databases there should be no duplicates.

This issue has made me paranoid that other tables may have had this problem historically, and therefore I'd like to verify this.

SOLUTION

I have created a SQL script to insert the count and distinct count of all columns into a table for all the databases on our server (excluding the 4 system databases):

DECLARE     @TableFullName AS NVARCHAR(MAX)
DECLARE     @SQLQuery AS NVARCHAR(MAX)
DECLARE     @TableHasDuplicates AS BIT
DECLARE     @TempTableRowCount AS INT
DECLARE     @ResultsTable TABLE ([CompleteTableName] NVARCHAR(200), [CountAll] INT, [CountDistinct] INT)
DECLARE     @CountAll INT
DECLARE     @CountDistinct INT

SET NOCOUNT ON
DECLARE @AllTables TABLE ([CompleteTableName] NVARCHAR(200))
INSERT INTO @AllTables ([CompleteTableName])
EXEC sp_msforeachdb 'SELECT ''['' + [TABLE_CATALOG] + ''].['' + [TABLE_SCHEMA] + ''].['' + [TABLE_NAME] + '']'' FROM [?].INFORMATION_SCHEMA.TABLES'
SET NOCOUNT OFF;

DECLARE     [table_cursor] CURSOR FOR 
(SELECT     *
FROM        @AllTables
WHERE       [CompleteTableName] NOT LIKE '%master%' AND [CompleteTableName] NOT LIKE '%msdb%' AND [CompleteTableName] NOT LIKE '%tempdb%' AND [CompleteTableName] NOT LIKE '%model%');

OPEN [table_cursor]

PRINT N'There were ' + CAST(@CountAll AS NVARCHAR(10)) + ' tables with potential duplicate data'

FETCH NEXT FROM [table_cursor]
INTO @TableFullName

WHILE @@FETCH_STATUS = 0
    BEGIN
        SET         @SQLQuery = 'SELECT @CntAll = COUNT(*) FROM ' + @TableFullName + ' SELECT @CntDistinct = COUNT(*) FROM  (SELECT DISTINCT * FROM ' + @TableFullName + ') AS [sq] IF @CntAll > @CntDistinct SELECT @BitResult=1 ELSE SELECT @BitResult=0';

        EXEC        sp_executesql @SQLQuery, N'@BitResult BIT OUTPUT, @CntAll INT OUTPUT, @CntDistinct INT OUTPUT', @BitResult = @TableHasDuplicates OUTPUT, @CntAll = @CountAll OUTPUT, @CntDistinct = @CountDistinct OUTPUT;

        IF @TableHasDuplicates = 1
            BEGIN
                INSERT INTO @ResultsTable ([CompleteTableName], [CountAll], [CountDistinct])
                SELECT @TableFullName, @CountAll, @CountDistinct
            END;

        FETCH NEXT FROM [table_cursor]
        INTO @TableFullName
    END 
CLOSE [table_cursor];
DEALLOCATE [table_cursor];

SELECT *
FROM @ResultsTable

An overview of how it works is the table variable @AllTables uses sp_msforeachdb with INFORMATION_SCHEMA.TABLES to list all the tables in all databases (there are 16537 tables). A table cursor is used to store all non-system entries and then I use dynamic SQL to undertake a count and distinct count which is stored in another table variable @ResultsTable.

THE PROBLEM WITH THIS SOLUTION

When I run this query, it will run for circa 3 minutes then throw an error saying that the tempdb PRIMARY filegroup is full: Error Message tempdb PRIMARY filegroup full

I am my own DBA, and I used Brent Ozar's guide to setting up my SQL server instance, and my tempdb is set up with 8 x 3GB mdf/ndf files (the server has 8 cores): tempdb setup

These files show as having 23997MB available under 'General' properties.

MY QUESTIONS

  1. If I have circa 24GB of tempdb free space, why is this relatively simple query running out of tempdb space?
  2. Is there a better/more efficiency way of getting a count and distinct count of all tables in all databases?
Quarcheek
  • 73
  • 9
  • How much data is being processed? If you distinct data that has a lot of distinct values then all of those values end up in a temporary worktable. The workarounds for this are non-trivial. You would need to process the table in chunks. – usr Jul 07 '15 at 12:02
  • Thanks for replying usr. Are you referring to the subquery: SELECT DISTINCT * FROM ' + @TableFullName ? Is that placed into tempdb? Obviously it processes thousands of tables so if it's caching all these tables (I would have thought the cache would be cleared each iteration of the loop?) then it will be hundreds of GBs of data. – Quarcheek Jul 07 '15 at 12:41
  • Is the subquery not removed from cache each iteration of the loop? I'll do some Googling on this. – Quarcheek Jul 07 '15 at 12:43
  • What cache are you talking about? The tempdb usage ends when each query ends. There must be some big query. – usr Jul 07 '15 at 12:46
  • I'll admit my knowledge of query caching is extremely limited, but why would SQL server cache all the tables (and the whole table at that) for each of the sub-queries running inside the loop? – Quarcheek Jul 07 '15 at 13:01
  • There is relevant caching here. It must somehow compute the distinct values. In order to do that it must sort or hash all the data. It needs temp storage for that. – usr Jul 07 '15 at 13:16
  • Thanks again for your help, sounds like I need to do some reading. – Quarcheek Jul 07 '15 at 13:30

1 Answers1

2

You should always consider contention before adding TempDb file. Adding 7 additional TempDb file won't really help.

If I have circa 24GB of tempdb free space, why is this relatively simple query running out of tempdb space?

No, it should not. But are you sure that you aren't dealing with large amount of data or you don't have other process running on SQL? Cursors, Temp tables and even table variables use TempDb extensively. Check which object is consuming more TempDb space:

SELECT
    SUM (user_object_reserved_page_count)*8 as usr_obj_kb,
    SUM (internal_object_reserved_page_count)*8 as internal_obj_kb,
    SUM (version_store_reserved_page_count)*8  as version_store_kb,
    SUM (unallocated_extent_page_count)*8 as freespace_kb,
    SUM (mixed_extent_page_count)*8 as mixedextent_kb
FROM sys.dm_db_file_space_usage

So, if your user and internal objects are more then it clearly means that you have low TempDb space because of cursors and SQL Server internal usage (Ex: intermediate tables, Hash joins, Hash aggregation etc)

Is there a better/more efficiency way of getting a count and distinct count of all tables in all databases?

You can use below code to get the count of all tables in all databases

  DECLARE @Stats TABLE (DBNAME VARCHAR(40), NAME varchar(200), Rows INT)
INSERT INTO @Stats
EXECUTE sp_MSForEachDB 
        'USE ?; SELECT DB_NAME()AS DBName, 
        sysobjects.Name
    , sysindexes.Rows
FROM
    sysobjects
    INNER JOIN sysindexes
    ON sysobjects.id = sysindexes.id 
WHERE
    type = ''U''
    AND sysindexes.IndId < 2'

    SELECT * FROM @Stats

I have written an article on TempDb recommendation; I would suggest you to read that to understand objects which can affect TempDb and how to solve common problems of it. Ideally, your total TempDb size should be calculated based on observation which in your case > 24 GB.

** Edit 1**

If you are unsure about stats update then use below query to get count of all tables Note: Replace databases for which you don't want stats

    DECLARE @ServerStats TABLE (DatabaseName varchar(200), TableName varchar(200), RowsCount INT)
INSERT INTO @ServerStats
exec sp_msforeachdb @command1='
use #;
if ''#'' NOT IN (''master'', ''model'', ''msdb'', ''tempdb'',''ReportServer'')
begin
print ''#''
exec sp_MSforeachtable @command1=''
SELECT ''''#'''' AS DATABASENAME, ''''?'''' AS TABLENAME, COUNT(*)  FROM ? ;
''
end 
', @replacechar = '#'

SELECT * FROM @ServerStats

similarly you can take distinct in all tables for all databases with below query

    DECLARE @ServerStatsDistinct TABLE (DatabaseName varchar(200), TableName varchar(200), RowsCount INT)
INSERT INTO @ServerStatsDistinct
exec sp_msforeachdb @command1='
use #;
if ''#''  NOT IN (''master'', ''model'', ''msdb'', ''tempdb'',''ReportServer'')
begin
print ''#''
exec sp_MSforeachtable @command1=''
SELECT ''''#'''' AS DATABASENAME, ''''?'''' AS TABLENAME, COUNT(*)  FROM  (
    SELECT DISTINCT *
    FROM ?
) a  ;
''
end 
', @replacechar = '#'

SELECT * FROM @ServerStatsDistinct
Anuj Tripathi
  • 2,251
  • 14
  • 18
  • Thanks for replying Anuj. While I don't claim to be an expert, I know enough about tempdb to know about contention and I have set 8 tempdb files for 8 processor cores following the Microsoft guidance at https://support.microsoft.com/en-us/kb/2154845. In this case, the problem is caused by the SELECT DISTINCT which is caching each time in tempdb. Since I am undertaking this query on every database in our server instance, no amount of tempdb reconfiguration will be enough to do this. However, my other area of experience is VB so I will write a VBA program to do this and post the answer here. – Quarcheek Jul 08 '15 at 09:45
  • I also wanted to point out that the code you provided for checking which objects for consuming tempdb space was very helpful, thank you (I've made use of this in the answer I will shortly provide). However, the code you've provided for getting the rows in each table is not very helpful as (1) it doesn't provided me the distinct count, which is a requirement of my question, and (2) it's based on statistics and not as accurate as COUNT(*). – Quarcheek Jul 08 '15 at 09:49
  • @Quarcheek I have edited my answer with updated query – Anuj Tripathi Jul 08 '15 at 12:19
  • Thanks for your continued support Anuj. In the end it turned out that using the query you gave me in your original answer helped me track down the source of the problem. I really like your distinct query actually, it's much more concise than my own. I didn't know that I could execute sp_msforeachdb into a table like that, or that you could nest sp_msforeachtable inside sp_msforeachdb. Very clever! Saves me using cursors! – Quarcheek Jul 08 '15 at 15:16
  • In the end the problem was that a large table (the only one larger than 24GB) was near the start of the results, and this was causing tempdb to throw the error. Once I sorted the results I discovered this problem. It's currently running through the last few databases once I excluded this table, so thanks again for your help. I've marked your answer as the accepted answer since the edit as you've answered both my questions successfully. :) – Quarcheek Jul 08 '15 at 15:18
  • @Quarcheek I am glad it helped! – Anuj Tripathi Jul 08 '15 at 15:30