I need to be able to read approx 100 million records from a SQL Server 2016 Database table, then generate hashes for one or more columns and write these records back to the table.
So far I have tried a couple solutions that are too slow for our needs. I am testing this on a Dell XPS 15 with i7-7700HQ processor and 32GB of RAM. First of all I tried using the T-SQL HASHBYTES() function with SHA1 hash but this was taking more than a couple of hours on a 100 million records test dataset.
Updating using a C# OleDbReader was even slower of course but the bottleneck seemed to be in writing the records. Now I am up to the point that I am using SqlBulkCopy to copy the altered records into a new temp table and this is much faster than updating an existing table. But is still takes me 40 minutes to generate all the hashses and writing the records back takes a multiple of this. Ideally we would want the entire operation to finish within the hour. Does anyone have an idea where I could optimize even further. Here is the code:
using (SqlConnection sourceConnection = new SqlConnection(connectionString))
{
sourceConnection.Open();
SqlDataAdapter adapter = new SqlDataAdapter("SELECT * FROM [ContosoRetailDW].[dbo].[FactInventory]", sourceConnection);
var dataTable = new System.Data.DataTable();
adapter.FillSchema(dataTable, System.Data.SchemaType.Source);
dataTable.Columns[2].DataType = typeof(string);
dataTable.Columns[2].MaxLength = 20;
adapter.Fill(dataTable);
for (int i = 0; i < dataTable.Rows.Count; i++)
{
byte[] toHash = Encoding.UTF8.GetBytes((string)dataTable.Rows[i][2]);
dataTable.Rows[i][2] = xxHash.CalculateHash(toHash).ToString("X");
}
using (SqlConnection destinationConnection = new SqlConnection(connectionString))
{
destinationConnection.Open();
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(destinationConnection.ConnectionString))
{
bulkCopy.BatchSize = 100000;
bulkCopy.DestinationTableName = "FactInventory_Hashed";
bulkCopy.WriteToServer(dataTable);
}
}
I have already tried playing with the bulk copy batch size and I am using a very fast hashing algorithm.