0

I have a problem of inserting huge amount of data to SQL server.

Previously I was using Entity framework, but it was damn slow for just 100K root level records ( containing separately two distinct collections, where each one is further operating on 200K records roughly ) = roughly 500K-600K records in memory. Here I applied all optimization ( e.g AutoDetectChangesEnabled = false, and recreated and disposed the context after each batch. )

I rejected the solution, and used BulkInsert that's substantially very fast and much efficient. Was just able to insert 100K records in a minute or so.

But the main problem is getting back primary keys from newly inserted records. For this , I am thinking to write stored procedure which could operate on TVP ( i.e in memory data table holding all root level 100K records ). and there inside I would use OUTPUT INSERTED.Id in order to get all primary keys inside application).

So, how can I compare this approach ( i.e Sql Insert query inside stored procedure ) with SqlBulkCopy approach.

Any idea if somehow, I can get all primary keys back after SqlBulkCopy operation? Or something concrete regarding OUTPUT Inserted.Id would return all correct new keys in application.

PS : I don't want to create any staging table during the process. This is just an overhead.

Usman
  • 2,742
  • 4
  • 44
  • 82
  • 1
    Possible duplicate of [Possible to get PrimayKey IDs back after a SQL BulkCopy?](http://stackoverflow.com/questions/2945414/possible-to-get-primaykey-ids-back-after-a-sql-bulkcopy) – JohnLBevan Dec 22 '16 at 15:38
  • @ JohnLBevan : I don't want to create any staging table and from there i could insert to final table. That's wrong idea. I am already passing TVP ( which is exactly holding the same data and have same structure as that of the target table ). – Usman Dec 22 '16 at 17:00
  • 1
    @Usman *not* using a staging table is typically a very, very bad idea. Using EF or any ORM for that matter for ETL is even worse. ETL scenarios have no business functionality, just data transformations. – Panagiotis Kanavos Dec 22 '16 at 17:09
  • @Usman; TVP (table valued parameters) are different to stuff being in the database itself. Try the suggested answer on the above link and see if it benefits you; if not, please let us know. – JohnLBevan Dec 22 '16 at 17:10
  • The best option is to *not* get the IDs back to the client. Use a staging table, update/insert or MERGE statements to upsert data into the final tables without moving data back and forth between the client and server. In the time it takes to move 500K IDs across the network you could complete the entire process. SQL Server already has SSIS for this job. If you don't want to use it (why?), you should treat your data as a stream, not individual records. You should process it using a dataflow pipeline. The TPL Dataflow library can help there – Panagiotis Kanavos Dec 22 '16 at 17:12
  • 2
    It would be nice if your title and body of the question were asking the same thing. In your title, you appear to be wanting to understand the internal implementation differences between the two. In the body, you seem to have a specific need to get IDs back from SQL BulkCopy - which indeed is an exact duplicate of the question linked. So - which is it? – Matt Johnson-Pint Dec 22 '16 at 18:27
  • @Panagiotis Kanavos : Could you please give me an idea , how can i treat the data as stream for this particular context using TPL data flow. Would be very nice, probably a new thing i would learn and enhance my concept. – Usman Dec 22 '16 at 20:59

2 Answers2

2

Here's an example based on discussion in the comments / expanding on the idea mentioned here: Possible to get PrimayKey IDs back after a SQL BulkCopy?

i.e.

  • Do a bulk upload from C# to a temp table in SQL
  • Use Sql to copy the data from the temp table to the actual table (at which point the IDs are generated), and return the IDs.

I've not had a chance to test this, but hopefully this will help:

//using System.Data.SqlClient;
//using System.Collections.Generic;

public DataTable CreatePersonDataTable(IEnumerable<PersonDTO> people) 
{
    //define the table
    var table = new DataTable("People");
    table.Columns.Add(new DataColumn("Name", typeof(string)));
    table.Columns.Add(new DataColumn("DOB", typeof(DateTime)));
    //populate it
    foreach (var person in people)
    {
        table.Rows.Add(person.Name, person.DOB);
    }
    return table;
}

readonly string ConnectionString; //set this in the constructor
readonly int BulkUploadPeopleTimeoutSeconds = 600; //default; could override in constructor
public IEnumerable<long> BulkUploadPeople(IEnumerable<PersonDTO> people) //you'd want to break this up a bit; for simplicty I've bunged everything into one big method
{
    var data = CreatePersonDataTable(people);
    using(SqlConnection con = new SqlConnection(ConnectionString)) 
    {
        con.Open(); //keep same connection open throughout session
        RunSqlNonQuery(con, "select top 0 Name, DOB into #People from People");
        BulkUpload(con, data, "#People");
        var results = TransferFromTempToReal(con, "#People", "People", "Name, DOB", "Id");
        RunSqlNonQuery(con, "drop table #People");  //not strictly required since this would be removed when the connection's closed as it's session scoped; but best to keep things clean.
    }
    return results;
}
private void RunSqlNonQuery(SqlConnection con, string sql)
{
    using (SqlCommand command = con.CreateCommand())
    {
        command.CommandText = sql;
        command.ExecuteNonQuery();      
    }
}
private void BulkUpload(SqlConnection con, DataTable data, string targetTable)
{
    using(SqlBulkCopy bulkCopy = new SqlBulkCopy(con))
    {
        bulkCopy.BulkCopyTimeout = 600; //define this in your config 
        bulkCopy.DestinationTableName = targetTable; 
        bulkCopy.WriteToServer(data);         
    }
}
private IEnumerable<long> TransferFromTempToReal(SqlConnection con, string tempTable, string realTable, string columnNames, string idColumnName)
{
    using (SqlCommand command = con.CreateCommand())
    {
        command.CommandText = string.Format("insert into {0} output inserted.{1} select {2} from {3}", realTable, idColumnName, columnNames, tempTable);
        using (SqlDataReader reader = command.ExecuteReader()) 
        {
            while(reader.Read()) 
            {
                yield return r.GetInt64(0);
            }
        }
    }
}

Whilst in your question you've added that you don't want to use a staging table as it's an "overhead"... please try. You may find that the small overhead of creating a staging table is less than the performance gain in using this method.

Obviously it's not going to be as fast as inserting and ignoring the returned ids; but if that's your requirement, in the absence of other answers, this may be the best alternative.

Community
  • 1
  • 1
JohnLBevan
  • 22,735
  • 13
  • 96
  • 178
1

Any idea if somehow, I can get all primary keys back after SqlBulkCopy operation

You cannot. There is no way to do it directly from the SqlBulkCopy.

PS : I don't want to create any staging table during the process. This is just an overhead.

Unfortunately, if you want to get your primary keys back, you will need to do it or use another approach (TVP like you suggested).

Disclaimer: I'm the owner of Entity Framework Extensions

One alternative solution is using a library which already support BulkInsert for Entity Framework. Under the hood, it's use SqlBulkCopy + Staging Tables.

By default, the BulkInsert method already output the primary key values.

The library is not free, however, it adds some flexibility to your company and you will not have to code/support anything.

Example:

// Easy to use
context.BulkSaveChanges();

// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);

// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);

// Customize Primary Key
context.BulkMerge(customers, operation => {
   operation.ColumnPrimaryKeyExpression = 
        customer => customer.Code;
});
Jonathan Magnan
  • 10,874
  • 2
  • 38
  • 60
  • lol ;) We have a lot of open-source libraries that we are happy to support for free but I also need to pay my bill. So, unfortunately, we need to also have a few paid ones. – Jonathan Magnan Dec 17 '21 at 14:14