0

I have quite a few million records into a table which supports WCF application. The application basically check for the existence of the recored and if not found it inserts that specific record.

one of the key field on which it checks for the existence is VARBINARY(256).

I am currently using HASHBYTES() to bring down the byte counts to 32 bytes by using SHA2_256 algorithm as below.

CREATE TABLE BlobTable ( BlobID INT, Blob VARBINARY(256), BlobHash VARBINARY(32)) 
DECLARE @Bin VARBINARY(256) = CRYPT_GEN_RANDOM(256) 
DECLARE @BinHash VARBINARY(32) = HASHBYTES('SHA2_256', @Bin)
DECLARE @Bin VARBINARY(256)

SELECT @_Bin = Blob FROM dbo.BlobTable WITH (ROWLOCK, READPAST) WHERE BlobHash = @BinHash
IF (@_Bin IS NULL)
BEGIN
    INSERT INTO dbo.BlobTable (Blob, BlobHash) VALUES (@Bin, @BinHash)
END

Is there a way to bring down query cost in the select statement above? Is there a way to get the unique value for VARBINARY(256) field which is shorter like VARBINARY(16) or less, but still avoiding the duplicates?

Thanks

user1144852
  • 265
  • 3
  • 6
  • 17

2 Answers2

0
CREATE UNIQUE INDEX blobhash_index ON BlobTable ( BlobHash);

SELECT LEN(HASHBYTES('MD5', Blob)) FROM BlobTable
nshah
  • 340
  • 1
  • 5
  • MD5 will get 20 bytes but, my feeling is that it would increase the chance of collision, since we are talking about millions of records. I have it indexed.But, just putting unique constraint, I would be rejecting that row. I do want to insert if all 256 bytes are different. Irrespective of hashes since I have it just to boost performance, but its collision is not good for the supported application. – user1144852 Aug 30 '14 at 04:28
  • This is interesting: http://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions – nshah Aug 30 '14 at 04:34
  • Also, is the table partitioned? http://msdn.microsoft.com/en-us/library/ms175533(v=sql.105).aspx – nshah Aug 30 '14 at 04:44
  • Looks very interesting...Thanks for sharing. I should probably use MD5 then. – user1144852 Aug 30 '14 at 04:44
  • In your case, security is not a concern but I should point out that MD5 is considered broken. http://stackoverflow.com/questions/770900/is-md5-less-secure-than-sha-et-al-in-a-practical-sense – nshah Aug 30 '14 at 04:53
0

I don't think the ROWLOCK/READPAST is necessary here. The key to performance is indexing. I ran a performance using the T-SQL below and observed a rate of about 5K/sec with a 10 thread test harness. The compound unique constraint key will allow non-existence of the row to be determined quickly and still allow different blobs to be inserted in the unlikely event of a hash collision. Be aware that the lack of serializable can result in a duplicate key violation if different threads try to insert the same blob at the same time so your code will need to handle that.

CREATE TABLE dbo.BlobTable(
      BlobID INT IDENTITY
        CONSTRAINT PK_BlobTable PRIMARY KEY CLUSTERED
    , Blob VARBINARY(256)
    , BlobHash VARBINARY(32)
    );
GO

--load 3M rows
WITH 
     t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n))
    ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d)
    ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c)
INSERT INTO dbo.BlobTable WITH(TABLOCKX) (Blob, BlobHash)
SELECT Blob, HASHBYTES('SHA2_256', Blob)
FROM (
    SELECT CRYPT_GEN_RANDOM(256) AS Blob
    FROM t16M
    WHERE num <= 3000000) AS Blobs;
UPDATE STATISTICS dbo.BlobTable WITH FULLSCAN;
ALTER TABLE dbo.BlobTable
    ADD CONSTRAINT UQ_BlobTable1_Blob_BlobHash UNIQUE NONCLUSTERED(BlobHash, Blob);
CHECKPOINT;
GO

CREATE PROC dbo.usp_insert_BlobTable
AS
SET NOCOUNT ON;

DECLARE @Bin VARBINARY(256) = CRYPT_GEN_RANDOM(256);
DECLARE @BinHash VARBINARY(32) = HASHBYTES('SHA2_256', @Bin);
DECLARE @_Bin VARBINARY(256);

INSERT INTO dbo.BlobTable (Blob, BlobHash)
SELECT @Bin, @BinHash
WHERE NOT EXISTS(
    SELECT *
    FROM dbo.BlobTable
    WHERE
        BlobHash = @BinHash
        AND Blob = @Bin
    );
Dan Guzman
  • 43,250
  • 3
  • 46
  • 71
  • I need to support about 1100 calls to the WCF service in a fraction of a second. That means this insert statement will be called that many times and I really need the transaction to be very very short. To my knowledge it will take less time to check on 32 bytes then to check on 256 bytes. I am using (ROWLOCK, READPAST) hint in my select query to check for the existence of the record. – user1144852 Aug 30 '14 at 22:41
  • @user1144852, I updated my answer according to your comments. – Dan Guzman Aug 31 '14 at 16:12