2

I Am using .NET Core WebApi and EF Core.

I have the following scenario:

  • I have an endpoint which accepts a CSV file
  • This CSV File contains more than 15.000 lines
  • I need to parse this CSV file and store each line in my database
  • Before I store each line, I need to check if the data inside exists already (4-5 values/line need to be checked if they exists as independent entities)

My problems:

  1. The checking if the data exist, takes a long time (15.000 times 4-5 checks...)

So I wanted to maybe create some Tasks to store batchwhise in the DB (but in .NET Core WebApi) you cannot run Background tasks, right?

How would you guys achieve this ?

bbrinck
  • 953
  • 1
  • 9
  • 33
  • 1
    You could upload the data into a temporary table using the bulk copy program, then query the data into the appropriate tables. This is much faster for bulk operations. – Oliver Apr 14 '20 at 10:12
  • 1
    Do you see the file getting bigger time goes? What happens when this goes from 15K lines to 15M lines, or 15B lines? If that is the case, using an ETL process may be a better option. – Raj More Apr 14 '20 at 10:38
  • 1
    How big is the table against which you are making the checks, could you not load it whole into memory? At least the 4-5 columns.. – Martin Staufcik Apr 14 '20 at 10:47
  • @RajMore what do you mean by this exactly (ETL)? – bbrinck Apr 14 '20 at 10:52
  • @MartinStaufcik Yes, I did this and this works great, the problem is later with the Insert. I tried to BulkInsert (EFCore.BulkExtensions) it but without success, i always get an error: InvalidOperationException: The given value of type String from the data source cannot be converted to type – bbrinck Apr 14 '20 at 10:54
  • 1
    [Here](https://stackoverflow.com/questions/18140012/sqlbulkcopy-the-given-value-of-type-string-from-the-data-source-cannot-be-conv) are some clues to the error – Martin Staufcik Apr 14 '20 at 10:59
  • ETL tools are meant to work with bulk data. While 15k doesn't qualify, 15M might, 15B is definitely a candidate. SSIS, Informatica, etc. are ETL tools, and need their own engines to run. – Raj More Apr 14 '20 at 13:59

1 Answers1

5

One approach would be to upload the data into a temporary table using SqlBulkCopy, then query the data into the appropriate tables. For example:

  1. Create temporary table
await connection.ExecuteAsync(@"CREATE TABLE #TempTable
(
    [X] int NULL,
    [Y] nvarchar(100) NULL,
    [Z] datetime NULL
)", null, transaction);

  1. Create DataTable and populate with CSV data
DataTable table = new DataTable();

dataTable.Columns.Add("X", typeof(int));
dataTable.Columns.Add("Y", typeof(string));
dataTable.Columns.Add("Z", typeof(DateTime));

//foreach csv record...

var row = dataTable.NewRow();

row["X"] = 1;
row["Y"] = "test";
row["Z"] = DateTime.Now;

dataTable.Rows.Add(row);

  1. Perform bulk-copy
using (var bulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.Default, transaction))
{
    bulkCopy.DestinationTableName = "#TempTable";
    bulkCopy.EnableStreaming = true;

    await bulkCopy.WriteToServerAsync(dataTable);
}
  1. Query data from temporary table into your destination tables. This is where you can perform your existing record checks.
  2. Drop temporary table
await connection.ExecuteAsync("DROP TABLE #TempTable", null, transaction);

I've found this approach much faster than EF for bulk importing data into a database, while maintaining some business logic.

Oliver
  • 8,794
  • 2
  • 40
  • 60