0

I have to do some SQL Server 2008 R2 performance testing and it would be very convenient to do it using only SSMS and SQL Server, without additional application support.

One of the tests I have to do is querying a self-referencing table (tree-like structure) with unknown content. So, for a start I would have to load something like 100K - 1M randomly parent-child-related rows into this table.

CREATE TABLE Test2 (
    ID int IDENTITY(1,1) PRIMARY KEY CLUSTERED NOT NULL,
    ParentID int NULL REFERENCES Test2 (ID))

I am currently trying with SSMS and this script to load 10K rows into the table:

SET NOCOUNT ON

INSERT INTO Test2 (ParentID)
VALUES (NULL)

DECLARE @n int = 0

;WHILE(1=1)
BEGIN
  --PRINT @n
  INSERT INTO Test2 (ParentID)
  SELECT TOP 1 ID FROM Test2 ORDER BY NEWID()

  SET @n = @n + 1
  IF(@n >= 9999)
    BREAK
END

SET NOCOUNT OFF

My problem is that it runs something like 2m 45s on my laptop. You can imagine how long it would take to load 100K or even 1M records this way.

I would like to have a faster way to load this random tree-like structure into database table using TSQL?

EDIT: After Mitch Wheat's suggestion, I replaced

SELECT TOP 1 ID FROM Test2 ORDER BY NEWID()

with

SELECT TOP 1 ID FROM Test2 
WHERE ID >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(ID) FROM Test2) 

Regarding random row selection, results really look uniformly distributed. Execution time falls from 160s to 5s (!) -> this enables me to insert 100K records in ~60s. However, inserting 1M records using my RBAR script is still very slow and I'm still searching for possible set-based expression to fill my table. If it exists.

Now, after ~10mins of filling random data I have 1M rows. It is slow but acceptable. However, to copy this data to another table using batch insert it takes <10s.

SELECT * 
INTO Test3
FROM Test2

So, I believe some form of batch insert could speed up the process.

OzrenTkalcecKrznaric
  • 5,535
  • 4
  • 34
  • 57

3 Answers3

1

You are not really measuring the INSERT performance with your posted code.

Picking a single random row using an ORDER BY clause like this:

SELECT TOP 1 * FROM table ORDER BY NEWID()

or even

SELECT TOP 1 * FROM table ORDER BY CHECKSUM(NEWID()) 

performs a table scan (because the random value associated with each row obviously needs to be calculated before the rows can be ordered), which can be slow for large tables. Using an indexed integer column (such as that commonly used for a primary key), and using:

SELECT TOP 1 * FROM table 
WHERE rowid >= RAND(CHECKSUM(NEWID())) * (SELECT MAX(rowid) FROM table) 

works in constant time, provided the rowid column is indexed. Note: this assumes that rowid is uniformly distributed in the range 0..MAX(rowid). If your dataset has some other distribution, your results will be skewed (i.e. some rows will be picked more often than others).

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
  • This drastically improves execution time! I will include this in question. However, my query is still RBAR and I'm searching for something better if it exists. – OzrenTkalcecKrznaric Jul 01 '13 at 09:06
  • 1
    You can't describe an INSERT as RBAR: inserts are row by row by definition! Unless you perform a Batch Insert... – Mitch Wheat Jul 01 '13 at 09:09
  • Would batch insert be possible here? Or series of batch inserts, each batch iteration doubled.. hm. – OzrenTkalcecKrznaric Jul 01 '13 at 09:14
  • What are you trying to measure? The fastest rate at which you can load data? If so use BCP, or SQLBulkCopy from C#. Or are trying to measure real-world insert performance where rows are inserting one at a time? – Mitch Wheat Jul 01 '13 at 09:16
  • I am trying to load random tree-like structure into database table using TSQL. I think 10 minutes is slow for loading 1M records. – OzrenTkalcecKrznaric Jul 01 '13 at 09:20
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/32668/discussion-between-ozren-tkalcec-krznaric-and-mitch-wheat) – OzrenTkalcecKrznaric Jul 01 '13 at 09:47
1

I ended up using my original aproach with some tweaks:

  • disabling reference constraint before insert and re-enabling afterwards
  • using batch inserts as Mitch Wheat suggested

This is the schema:

DROP TABLE Test2
GO

CREATE TABLE Test2 (
    ID int IDENTITY(1,1) PRIMARY KEY CLUSTERED NOT NULL,
    ParentID int NULL /*REFERENCES Test2 (ID)*/
)
GO

ALTER TABLE Test2 
  ADD CONSTRAINT FK_SelfRef
    FOREIGN KEY(ParentID) REFERENCES Test2 (ID)
GO

And the script:

CHECKPOINT;
DBCC DROPCLEANBUFFERS;

SET NOCOUNT ON

ALTER TABLE Test2 NOCHECK CONSTRAINT FK_SelfRef

INSERT INTO Test2 (ParentID)
VALUES (NULL)

DECLARE @n int = 1

;WHILE(1=1)
BEGIN
  INSERT INTO Test2 (ParentID)
  SELECT ID FROM Test2 ORDER BY NEWID()

  SELECT @n = COUNT(*) FROM Test2

  IF(@n >= 999999)
    BREAK
END

ALTER TABLE dbo.Test2 WITH CHECK CHECK CONSTRAINT FK_SelfRef

SET NOCOUNT OFF

This executes in 10 secs, and I can't do it this fast with any other method.

NOTE: It inserts more records than needed. But the method can be arranged to insert exact no of records by limiting number of inserts in the last pass.

OzrenTkalcecKrznaric
  • 5,535
  • 4
  • 34
  • 57
0

When parent is assigned randomly from the previously inserted rows, there is no control over the tree height (number of levels) and the way levels populated, which may not be desired in some scenarios.

It may be more convenient to populate tree with a data level by level.

Auxiliary table valued function is taken to generate numbers sequence using Itzik's cross joined CTE method (see e.g. here about it)

create function ftItziksCJCTE
(
    @cnt int
)
returns table as
return
(
    WITH
        E00(N) AS (SELECT 1 UNION ALL SELECT 1),
        E02(N) AS (SELECT 1 FROM E00 a, E00 b),
        E04(N) AS (SELECT 1 FROM E02 a, E02 b),
        E08(N) AS (SELECT 1 FROM E04 a, E04 b),
        E16(N) AS (SELECT 1 FROM E08 a, E08 b),
        E32(N) AS (SELECT 1 FROM E16 a, E16 b),
        E(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
    select N from E where N <= @cnt
)

Simple table to control elements distribution in the tree:

create table #TreeLevels
(
    LevelNo int identity(1, 1) not NULL,
    MinElements int not NULL,
    MaxElements int not NULL,
    primary key clustered (LevelNo)
)

Sample distribution:

insert into #TreeLevels values (7, 10)
insert into #TreeLevels values (70, 100)
insert into #TreeLevels values (700, 1000)

Will give us something like 7 to 10 elements with ParentID = NULL, each of them having something like 70 to 100 elements, etc. With total number of elements 343000 to 1000000

Or other distribution:

insert into #TreeLevels values (1, 1)
insert into #TreeLevels values (9, 15)
insert into #TreeLevels values (10, 12)
insert into #TreeLevels values (9, 15)
insert into #TreeLevels values (10, 12)
insert into #TreeLevels values (9, 15)
insert into #TreeLevels values (10, 12)

Meaning there will be single root element with something between 9 and 15 child elements, each of them having something like 10 to 12 elements, etc.

Then tree can be populated level by level:

declare @levelNo int, @eMin int, @eMax int

create table #Inserted (ID int not NULL, primary key nonclustered (ID))
create table #Inserted2 (ID int not NULL, primary key nonclustered (ID))

set @levelNo = 1
while 1=1
begin
    select @eMin = MinElements, @eMax = MaxElements from #TreeLevels where LevelNo = @levelNo

    if @@ROWCOUNT = 0
        break

    if @levelNo = 1
    begin
        insert into TestTree (ParentID)
        output inserted.ID into #Inserted (ID)
        select NULL from ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0))
    end
    else
    begin
        if exists (select 1 from #Inserted)
        begin
            insert into TestTree (ParentID)
            output inserted.ID into #Inserted2 (ID)
            select
                I.ID
            from
                #Inserted I
                cross apply ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0)) F

            truncate table #Inserted
        end
        else
        begin
            insert into TestTree (ParentID)
            output inserted.ID into #Inserted (ID)
            select
                I.ID
            from
                #Inserted2 I
                cross apply ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0)) F

            truncate table #Inserted2
        end
    end

    set @levelNo = @levelNo + 1
end

However, there is no control on the exact number of elements the tree will contain and leaf nodes are on the last level only. It would be good to have additional parameter controlling level population (percent of nodes on the same level which will have children).

create table #TreeLevels
(
    LevelNo int identity(1, 1) not NULL,
    MinElements int not NULL,
    MaxElements int not NULL,
    PopulatedPct float NULL,
    primary key clustered (LevelNo)
)

Sample distribution:

insert into #TreeLevels values (1, 1, NULL)
insert into #TreeLevels values (9, 15, NULL)
insert into #TreeLevels values (10, 12, NULL)
insert into #TreeLevels values (9, 15, 80)
insert into #TreeLevels values (10, 12, 65)
insert into #TreeLevels values (9, 15, 35)
insert into #TreeLevels values (10, 12, NULL)

NULL for a PopulatedPct percent is treated as 100%. PopulatedPct controls next level population and should be taken from previous level during cycle. Also it has no meaning for the last row in the #TreeLevels hence.

Now we can cycle trough levels taking PopulatedPct into account.

declare @levelNo int, @eMin int, @eMax int

create table #Inserted (ID int not NULL, primary key nonclustered (ID))
create table #Inserted2 (ID int not NULL, primary key nonclustered (ID))

set @levelNo = 1
while 1=1
begin
    select @eMin = MinElements, @eMax = MaxElements from #TreeLevels where LevelNo = @levelNo

    if @@ROWCOUNT = 0
        break

    if @levelNo = 1
    begin
        insert into TestTree (ParentID)
        output inserted.ID into #Inserted (ID)
        select NULL from ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0))
    end
    else
    begin
        declare @pct float
        select @pct = PopulatedPct from #TreeLevels where LevelNo = @levelNo - 1

        if exists (select 1 from #Inserted)
        begin
            if (@pct is NULL)
                insert into TestTree (ParentID)
                output inserted.ID into #Inserted2 (ID)
                select
                    I.ID
                from
                    #Inserted I
                    cross apply ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0)) F
            else
                insert into TestTree (ParentID)
                output inserted.ID into #Inserted2 (ID)
                select
                    I.ID
                from
                    (select top (@pct) PERCENT ID from #Inserted order by rand(checksum(newid()))) I
                    cross apply ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0)) F

            truncate table #Inserted
        end
        else
        begin
            if (@pct is NULL)
                insert into TestTree (ParentID)
                output inserted.ID into #Inserted (ID)
                select
                    I.ID
                from
                    #Inserted2 I
                    cross apply ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0)) F
            else
                insert into TestTree (ParentID)
                output inserted.ID into #Inserted (ID)
                select
                    I.ID
                from
                    (select top (@pct) PERCENT ID from #Inserted2 order by rand(checksum(newid()))) I
                    cross apply ftItziksCJCTE(round(rand(checksum(newid())) * (@eMax - @eMin) + @eMin, 0)) F

            truncate table #Inserted2
        end
    end

    set @levelNo = @levelNo + 1
end

Still there is no control over the exact number of elements, but better control over the tree shape is gained.

Community
  • 1
  • 1
i-one
  • 5,050
  • 1
  • 28
  • 40
  • This is very interesting answer, considering I didn't think of a tree shape at all. I have to test it yet, but I get the point. +1 – OzrenTkalcecKrznaric Jul 02 '13 at 22:03
  • I did tests of your scripts. First variant filled approximately 550K rows in three test-passes, with execution time of 10s average. Second variant filled 530K rows in 11sec. Distribution seems ok. Tree depth is uniformly distributed throughout branches, while my method doesn't have that kind of distribution. However, this method is 40-50% slower than my simple batch insert. Is there a method to rearrange it to be somewhat faster? – OzrenTkalcecKrznaric Jul 04 '13 at 12:00
  • When you did tests, did you disabled FK_SelfRef as in your method? If not, may be it will help. If yes, than seems I don't see ways to improve it (top of my knowledge currently). – i-one Jul 04 '13 at 22:19