3

I need to be able to read approx 100 million records from a SQL Server 2016 Database table, then generate hashes for one or more columns and write these records back to the table.

So far I have tried a couple solutions that are too slow for our needs. I am testing this on a Dell XPS 15 with i7-7700HQ processor and 32GB of RAM. First of all I tried using the T-SQL HASHBYTES() function with SHA1 hash but this was taking more than a couple of hours on a 100 million records test dataset.

Updating using a C# OleDbReader was even slower of course but the bottleneck seemed to be in writing the records. Now I am up to the point that I am using SqlBulkCopy to copy the altered records into a new temp table and this is much faster than updating an existing table. But is still takes me 40 minutes to generate all the hashses and writing the records back takes a multiple of this. Ideally we would want the entire operation to finish within the hour. Does anyone have an idea where I could optimize even further. Here is the code:

using (SqlConnection sourceConnection = new SqlConnection(connectionString))
{
sourceConnection.Open();

SqlDataAdapter adapter = new SqlDataAdapter("SELECT *  FROM [ContosoRetailDW].[dbo].[FactInventory]", sourceConnection);

var dataTable = new System.Data.DataTable();

adapter.FillSchema(dataTable, System.Data.SchemaType.Source);              

dataTable.Columns[2].DataType = typeof(string);
dataTable.Columns[2].MaxLength = 20;

adapter.Fill(dataTable);

for (int i = 0; i < dataTable.Rows.Count; i++)
{
    byte[] toHash = Encoding.UTF8.GetBytes((string)dataTable.Rows[i][2]);
    dataTable.Rows[i][2] = xxHash.CalculateHash(toHash).ToString("X");
}

using (SqlConnection destinationConnection = new SqlConnection(connectionString))
{
    destinationConnection.Open();

    using (SqlBulkCopy bulkCopy = new SqlBulkCopy(destinationConnection.ConnectionString))
    {
        bulkCopy.BatchSize = 100000;
        bulkCopy.DestinationTableName = "FactInventory_Hashed";
        bulkCopy.WriteToServer(dataTable);
    }
}

I have already tried playing with the bulk copy batch size and I am using a very fast hashing algorithm.

David H.
  • 31
  • 1
  • See if creating a new table helps https://stackoverflow.com/questions/3711217/fastest-way-to-update-120-million-records – Mohit May 29 '17 at 12:29
  • 1
    batch all of this into bundles of n rows and run the operations in parallel? – Alex K. May 29 '17 at 12:32
  • 1
    Your current approach loads the entire source table to memory? How are you not getting out of memory? You could try a streaming approach like and SSIS DataFlow with the hashing done in a ScriptTransform. None of these will help you if the bottleneck is DB/IO. – user6144226 May 29 '17 at 12:52
  • Yes the idea was to load each table into memory, hash it and write it back. peak memory usage was about 9 GB to 12GB. This is because xxHash works at near RAM speeds so I thought having the table in RAM would utilize xxHash performance best. I will profile my I/O though to see if that is the bottleneck. On my small dataset with 8 million records the SqlBatchCopy is fast in just over a minute. So batching up the hashing as well and run this in seperate threads is something i will investigate as well. Otherwise i'll put in inside our SSIS data prep package. – David H. May 29 '17 at 13:01
  • 2
    You could implement your own DataReader to hash the values on the fly and use that as the source to stream data via SqlBulkCopy. – Dan Guzman May 29 '17 at 13:03
  • I usually deal with almost millions rows data. Seems you're needed to replace only Rows[i][2] with hashed value. I recommend to only SELECT required column because if it's millions, only reading might take long time but shorter with single required column. – Kay Lee May 29 '17 at 13:41
  • 1
    the critical point in dealing with bulky data is you must avoid 'one by one' operation. We know bulkcopy is just a quick shot but inserting one by one with millions rows with for or foreach loop take long journey. This kind of SQL function does its best performance unlike one by one forced by you. http://mobile.databasejournal.com/features/mssql/getting-started-with-hashing-in-sql-server.html – Kay Lee May 29 '17 at 13:52
  • I already tried the HASHBYTES function with computed columns but this was too slow. Also I don't need a cryptographic hash function so I am trying to speed up the process with xxHash which is lots faster than SHA-1. But if HASHBYTES already is very optimized I worry that I might not get very far with my own implementation instead. Although I will try reading only the neccesary column and I am trying the custom DataReader implementation as well. – David H. May 29 '17 at 14:04
  • 1
    furthermore, seems your speed limiting step is one by one for loop. just change it to foreach loop. foreach is the fastest one.. – Kay Lee May 29 '17 at 14:05
  • First, you would probably get a better response on http://DBA.stackexchange.com. Secondly, how long does a null Update take to execute on this table (`UPDATE foo SET hashCol = 0`). You aren't going to get any faster than that without changes on the DBMS side. – RBarryYoung May 29 '17 at 16:26
  • If on-server HASHBYTES was too slow, there is almost no chance that shipping the data to the client, munging it and then shipping it back to the server will be faster. – RBarryYoung May 29 '17 at 16:29

0 Answers0