SQL Server 2016 bulk insert csv and generate sha1 from specific columns and insert into a column

Question

I am fairly new to SQL Server and I am running the following command from inside my c# application:

DECLARE @SQLString nvarchar(4000);
SET @SQLString = N'BULK INSERT events FROM '+ QUOTENAME(@p0) +' WITH ( BATCHSIZE = 50000, CODEPAGE = ''65001'', FIELDTERMINATOR = ''|'', ROWTERMINATOR =''\n'' )'
EXECUTE sp_executesql @SQLString

With @p0 being the path to the .csv file.

Right now I am generating a sha1 HashCode as a BigInteger from a combination of 3 columns in my C# code and write it into the csv file into a new column (which is the primary key).

Now I saw it is possible to generate the sha1 hash inside the sql server. Is this possible while bulk inserting?

e.g. Bulk insert csv file. For each row take column X,Y,Z and generate sha1 hash. Convert it to BigInteger and insert it in column P?

EDIT: I am trying the answer from @Nick.McDermaid: But I can't seem to get it working:

CREATE TABLE [dbo].[test] (
[User] [nvarchar](185) NOT NULL,
[Stat] [nvarchar](25) NOT NULL,
[Name] [nvarchar](max) NOT NULL,
[HashByte] AS (convert(bigint, HASHBYTES('SHA1',CONVERT(nvarchar(max),[User]+[Stat]+‌[Name])))),

CONSTRAINT [PK_dbo.test] PRIMARY KEY ([HashByte]))

I get an Error for incorrect syntax.

The easiest way is to create a calculated column in the table that you are bulk inserting into and define the sha1 expression there. — Nick.Mc, Sep 26 '17 at 10:59
I would think it's probably faster to do the bulk insert and then update the rows afterward. update mytable set hashcol = convert(bigint, HASHBYTES('SHA1',CONVERT(nvarchar(4000),x+y+z))) — Mike, Sep 26 '17 at 11:10
@ Mike How can I insert them without the primary key which is the Hash? @Nick.McDermaid Sounds reasonable. Can you point me in the right direction where I can read about calculated columns? I came up with this: Alter Table [tableName] Add HashByte As (convert(bigint, HASHBYTES('SHA1',CONVERT(nvarchar(4000),[column1]+[column2]+[[column3])))) ? — Thypari, Sep 26 '17 at 11:15
@Nick.McDermaid I edited my question. I tried creating the calculated column when creating the table. But somehow I can't find the syntax error... — Thypari, Sep 26 '17 at 11:35
Your syntax looks fine. I tried it and also got an error. I reduced it and eventually got "Column 'HashByte' in table 'test' is of a type that is invalid for use as a key column in an index.". So you're not going tot be able to create a primary key on it. Maybe a unique constraint. What's the purpose of having a primary key on it anyway? — Nick.Mc, Sep 26 '17 at 11:44
No you can't create any kind of unique constraint on that column. Depending on your requirements, there's nothing wrong with doing the update as @Mike mentioned, then adding the PK afterwards. — Nick.Mc, Sep 26 '17 at 11:47
Well... I want to make sure there are no duplicate entries of a specific column combination. So for me the easiest was to use IGNORE_DUP_KEY, generate a sha1 hash out of these 3 columns and use it as PK. This way I don't have to check every single entry if this column combination is already there. — Thypari, Sep 26 '17 at 11:47
What if there is a duplicate there? What do you want to do in that case? Why not just create a composite key on all three columns? — Nick.Mc, Sep 26 '17 at 12:13
I don't know that I would create a primary key on that column, maybe use an identity column for the PK and put a secondary index on the hash. Have you tried creating the table with the hashbytes function as a default on the hash column? Something like here: https://stackoverflow.com/questions/442503/bind-a-column-default-value-to-a-function-in-sql-2005 — Mike, Sep 26 '17 at 13:44
A composite key has a 900byte index limit as far as I know. One of these columns alone is above that limit. @Mike So create a normal ID column and secondary index on hash? I will try it with a function tomorrow! — Thypari, Sep 26 '17 at 14:08

score 1 · Answer 1 · answered Sep 27 '17 at 00:20

1

I suggest you take a step back here: firstly: are you saying that if there is a one character difference in your varchar(max) field (2Gb), then the record is a unique record? What is the purpose of defining unique records here and what happens when a "duplicate" appears?

In this situation I recommend you follow this very commonly used staging pattern, which ends up being used for most data import processes

BULK INSERT into a staging table that has no PK
Use INSERT to only insert unique records into your real table
Your real table has a simple int identity PK and is guaranteed to be unique on the required columns due to step 2
You can identify records with issues in your staging table

From my experiments, it's not possible to create any kind of unique constraint/unique index/PK on this particular calculated field

Some sample code for step 1&2 would be:

-- Bulk insert into staging table
BULK INSERT staging.events FROM.....

-- Only insert records from staging that aren't already there
INSERT INTO dbo.events (User,Stat,Name)
SELECT User,Stat,Name 
FROM staging.events S
WHERE NOT EXISTS (
   SELECT * FROM dbo.events E
   WHERE E.User = S.User
   AND E.Stat = S.Stat
   AND E.Name = S.Name
)

Now if you like you can write another update back to the staging table that identifies duplicates.

It really comes back to the meaning of 'duplicate'. If you have one character difference in your Name column, is that a duplicate?

answered Sep 27 '17 at 00:20

Nick.Mc

18,304
6
61
91

Thanks for your answer! If one character is different in one of the three columns it is unique.Using nvarchar(max) was stupid from me. The length can vary but will most likely not exceed 2000 characters. – Thypari Oct 02 '17 at 11:58
I wonder what the performance would be if I had 5m+ entries in my dbo.events already and have to check another 200,000 from a csv if they are already in the dbo.events? – Thypari Oct 04 '17 at 12:56
Depends a lot on whether you have indexes and how much RAM you have – Nick.Mc Oct 04 '17 at 13:01
I would have to index all three columns. But then a clustered index has a limit of 900bytes which wouldn't be enough. And 16GB of ram. – Thypari Oct 04 '17 at 13:31
Oh sorry now I'm with you. I think you should pursue the hash solution and index the hash column. Does the hash function work with varchar(2000) instead of varchar(max)? – Nick.Mc Oct 04 '17 at 13:43
I think I kind of lost track... You suggest I use your answer but create a hash in the staging and regular table which I index for faster lookup? – Thypari Oct 04 '17 at 13:48
Sorry I'm all over the place. My answer will probably work but will be slow for that many records. If you can add this hash to both tables and index it then join on that then that will be a lot faster. – Nick.Mc Oct 04 '17 at 13:51
So yes- that's what I suggest but first you need to see if you can get the hash working. You might have more success updating afterwards rather than using a calculated column. – Nick.Mc Oct 04 '17 at 13:52
I got it working with CREATE TABLE dbo.test ( [User] nvarchar(185) NOT NULL, [Stat] nvarchar(25) NOT NULL, [Name] nvarchar(max) NOT NULL, [HashByte] AS CAST(HASHBYTES('SHA1', CONCAT([User],[Stat],[Name])) AS BIGINT) PERSISTED, CONSTRAINT [PK_dbo.test] PRIMARY KEY ([HashByte]) ) And then use no_dup_keys! – Thypari Oct 04 '17 at 15:17
Awesome. Could you perhaps post an answer with this info in it - it will help others with the same challenge – Nick.Mc Oct 04 '17 at 22:57

score 0 · Accepted Answer · answered Oct 05 '17 at 20:05

I got it working with

CREATE TABLE dbo.test ( 
[User] nvarchar(185) NOT NULL, 
[Stat] nvarchar(25) NOT NULL, 
[Name] nvarchar(max) NOT NULL, 
[HashByte] AS CAST(HASHBYTES('SHA1', CONCAT([User],[Stat],[Name])) AS BIGINT)
PERSISTED, CONSTRAINT [PK_dbo.test] PRIMARY KEY ([HashByte]) )

And then use no_dup_keys!

SQL Server 2016 bulk insert csv and generate sha1 from specific columns and insert into a column

2 Answers2