0

I am in the process of re-writing a stored procedure that obtains the row count and max Id of all tables in a database. The database has close to 500 tables and over a billion entries, so the old procedure is way too slow to continue to use.

Here is the old procedure:

DECLARE @TableRowCounts TABLE ([TableName] VARCHAR(128), [RowCount] BIGINT, [MaxId] BIGINT) ;
INSERT INTO @TableRowCounts ([TableName], [RowCount], [MaxId])
EXEC sp_MSforeachtable 'SELECT ''?'' [TableName], COUNT(*) [RowCount], MAX(Id) [MaxId] FROM ?' ;
SELECT [TableName], [RowCount], [MaxId]
FROM @TableRowCounts
ORDER BY [TableName]

This would give a result something like the following:

TableName | RowCount | MaxId
-------------------------------
TableA    | 12345678 | 12345688
TableB    | 90123456 | 90123466
TableC    | 78901234 | 78901244

I can't quite say how long it takes to run because I have yet to actually observe it complete at the current database's size.

Here is a work-in-progress new query:

SELECT 
  o.NAME, 
  i.rowcnt
FROM sysindexes AS i
INNER JOIN sysobjects AS o ON i.id = o.id
--INNER JOIN sys.tables AS t ON t.[object_id] = o.id ???
--INNER JOIN sys.schemas AS s on t.[schema_id] = s.[schema_id] ???
--INNER JOIN sys.columns AS c on t.[object_id] = c.[object_id] ???
WHERE i.indid < 2  AND OBJECTPROPERTY(o.id, 'IsMSShipped') = 0
ORDER BY o.NAME

My idea was to use sys.schemas and sys.columns so that I can use MAX(Id) inside of my SELECT, but I am currently stuck on how to fully incorporate this functionality. If there is another, better way to do this, I am open to suggestions.

I do need both the row count and MAX(Id). My dataset should not contain any missing Ids, and this will help show that one is missing at a glance. The data is being cached from an external source, and no Ids should be missing, so if the row count is not equal to MAX(Id), the client consuming the database can see this and take the necessary actions to fill in the missing rows. The client also uses the row count and MAX(Id) for other tasks, such as comparing the external source's current Id to the database's max id. If the external source's current Id is greater than the database's MAX(Id), there is work to be done.

Haus
  • 1,492
  • 7
  • 23
  • Just a note, but MAX(Id) won't give you the proper row count if your tables have ever had a DELETE run on them or if Id is an IDENTITY column and the identity has been reseeded at any time. Might not be the case for you but wanted to point that out. – squillman Mar 11 '19 at 11:39
  • What if column of name Id is not present in any of the table? – Suraj Kumar Mar 11 '19 at 11:39
  • Try this to get the row count https://stackoverflow.com/a/2221898/10532500 – Suraj Kumar Mar 11 '19 at 11:40
  • @squillman I understand this. Sorry if my wording is unclear. I want the row count as well as the max id of the table. – Haus Mar 11 '19 at 12:00
  • @SurajKumar All tables have the same schema, so `Id` is guaranteed to be present. Thank you for the link, but my new procedure already successfully gets the row count for each table, just not `MAX(Id)` – Haus Mar 11 '19 at 12:01
  • I think it would help if you could tell us *why* you need the max `id` value for each table all in one go? – iamdave Mar 11 '19 at 12:10
  • @iamdave I have appended the original post to explain why I need both. Essentially it is required by the client program to determine what work needs to be done. – Haus Mar 11 '19 at 12:13
  • *what work needs to be done" - Across a billion rows of data, what does this actually mean? This seems like a badly designed solution to a problem you have still yet to explain. I would imagine there is a much more elegant solution to your problem than this one, though without knowing that actual purpose, we can't help you with it. – iamdave Mar 11 '19 at 12:16
  • @iamdave The external data source I am caching locally provides a "Current Id", which designates the last record added. I need to be able to check my local database's `MAX(Id)` and compare it to the external data source's "Current Id", and if it is less than the source, the client will see this and queue tasks that will download the data and store them in the local database. The row count is then for determining if any records may be missing. – Haus Mar 11 '19 at 12:18
  • @Haus Request you to check my answer. – Suraj Kumar Mar 11 '19 at 12:40
  • Have you tried creating a trigger in each table, that updates a separate table that stipulates all the Tables in total? So you would have a table called `DatabaseTables`, within this table you will have all the tables listed as rows, with the columns as your expected results. Then you can can simple run a select query like `SELECT * FROM DATABASETABLES`. Would this work? – Attie Wagner Mar 11 '19 at 13:03

2 Answers2

1

As explained in my comment, you can use triggers to accomplish this.

This will significantly increase the speed of how long the Stored Procedure used to run.

The following script, will create a table in the database with all the Tables and provide you with the triggers to run on all of them:

declare @loop   int
,       @query  varchar(max)

if not exists(select name from sysobjects where name = 'DatabaseTables')    --drop table DatabaseTables
create table DatabaseTables (id int identity primary key, TableName varchar(50), IdentityColumn varchar(50), [RowCount] int, MaxId int)
insert into DatabaseTables (TableName, IdentityColumn)
select
    name
,   column_name
from        sysobjects  o
inner join  information_schema.columns  c on o.name = c.table_name
where   xtype = 'u'
and     c.ordinal_position = 1
and     name    <>  'DatabaseTables'
and     data_type = 'int'
and     name not in (select TableName from DatabaseTables)
order by name

begin

select @loop = min(id) from DatabaseTables
while @loop is not null

begin
begin
    set @query = 

'set ansi_nulls on
go
set quoted_identifier on
go

create trigger '+(select TableName from DatabaseTables where id = @loop)+'_trg on '+(select TableName from DatabaseTables where id = @loop)+'
after insert
as

if (select trigger_nestlevel(object_id('''+(select TableName from DatabaseTables where id = @loop)+'_trg''))) > 1
     return

begin
    update DatabaseTables set
        [RowCount]  = (select count(*) from '+(select TableName from DatabaseTables where id = @loop)+')
    ,   MaxId       = (select max('+(select IdentityColumn from DatabaseTables where id = @loop)+') from '+(select TableName from DatabaseTables where id = @loop)+')
    where   TableName = '''+(select TableName from DatabaseTables where id = @loop)+'''
end;

'


    print (@query)
end
    select @loop = min(id) from DatabaseTables where id>@loop
end
end;

From here, instead of running a stored procedure to get your results, you will just run:

select
*
from    DatabaseTables

Let me know if this is could work?

Attie Wagner
  • 1,312
  • 14
  • 28
0

You can try the below query. The description is given in the form of comment.

CREATE TABLE #x(t NVARCHAR(520), c BIGINT); --Used to store max id
CREATE TABLE #counts --Used to store rowcount
(
    table_name varchar(255),
    row_count int
)

--Query to get max id in a table - #x
DECLARE @sql NVARCHAR(MAX);

SET @sql = N'';
SELECT @sql = @sql + N'INSERT #x SELECT ''' 
  + QUOTENAME(s.name) + '.' + QUOTENAME(t.name) + ''',
  MAX(' + c.name + ') FROM '
  + QUOTENAME(s.name) + '.' + QUOTENAME(t.name) + ';'

FROM sys.columns C
    INNER JOIN sys.tables T ON C.object_id = T.object_id
    INNER JOIN sys.schemas s ON S.schema_id = T.schema_id
WHERE c.name = 'Id'; --To specify the column name for max id value

EXEC sp_executesql @sql;
--SELECT t, c FROM #x; --To get the max id

--Query to get row count in a table - #counts
EXEC sp_MSForEachTable @command1='INSERT #counts (table_name, row_count) SELECT ''?'', COUNT(*) FROM ?'
--SELECT table_name, row_count FROM #counts ORDER BY table_name, row_count DESC -- To get the row count

Select table_name as [Table Name], c as [Max Id], row_count as [Total Rows]
from #counts
inner join #x on t = table_name

DROP TABLE #counts
DROP TABLE #x;

The output will come in the below format.

Table Name  Max Id  Total Rows
----------------------------
[dbo].[Employee]    8   8
[dbo].[test]        3   3
[dbo].[Family]      2   6
Suraj Kumar
  • 5,547
  • 8
  • 20
  • 42
  • 1
    The query takes a very long time to run. I think the problem with this (and my original query) is that it is using `COUNT(*)` on each table, which is a very expensive operation. If there is a way to use the `sys.indexes` catalog for row counts instead, I'm sure it would be much faster. – Haus Mar 11 '19 at 12:45