0

I have a requirement to process a large 4.5GB CSV file of flight data with 6 million rows.

I have a console app in C# that reads each row, and saves the flight and flight leg data to various tables in our SQL Server database.

Currently I'm taking the basic line-by-line approach, calling a stored procedure to insert the relevant data into the relevant table.

while ((inputLine = csvReader.ReadLine()) != null)
{
   // process data line by line and save to database via stored procedure
   var flightModel = new FlightModel { FlightNumber = inputLine[0] };
   var newFlightId = await executeStoredProc(flightModel);
   
   var flightLegModel = new LegModel { FlightId = newFlightId, Time = inputLine[1] };
   await executeStoredProc(flightLegModel);
}

However this is taking about an hour per million rows and I need this to run once a night for maximum one or two hours ideally.

Splitting the file and processing on another server concurrently would be expensive, ideally looking for a quicker way to speed the above process up on one machine.

mb1231
  • 308
  • 3
  • 16
  • 3
    Use `SqlBulkCopy` to load the data into your staging tables. You have not shown your sql procedure that processes your data so unable to offer any comment. – Stu Jul 27 '21 at 09:35
  • the stored procedure just calls a basic `INSERT INTO [table]` statement. the processing happens in code (just me indexing the csv row and assigning to data model in c#) – mb1231 Jul 27 '21 at 09:37
  • Are you doing this all on one thread or are you parallelising this? If so could we see your implementation? Can we see how your manipulating the data before the queries? – DekuDesu Jul 27 '21 at 09:38
  • Are you saying you're reading the data in code, doing required processing and calling a proc to insert per row? – Stu Jul 27 '21 at 09:39
  • @DekuDesu - all on one thread, but it can be multi-threaded i assume given each row can be processed separately? i'll update the question to show the processing part, it's basically just me indexing the csv row and assigning data to a basic C# model. – mb1231 Jul 27 '21 at 09:40
  • @Stu - yes i am processing in code, the stored proc is called after using the params in the C# model. – mb1231 Jul 27 '21 at 09:41
  • 2
    If you want to process your data as fast as possible, line by line is exactly how not to do that. I process a 1+m rows from a CSV file that's 3.5gb in size, using `sqlbulkload` from an ETL c# application, into a staging load table then clean the data and insert/update as necessary in the database, takes about 9 minutes. It's run several times per day. – Stu Jul 27 '21 at 09:47
  • @Stu thanks. so should my staging load table just be a table with all the properties of the entire CSV row and then i use `sqlbulkload` to copy all CSV rows to that? and then use a stored procedure to go through each row and copy into the relevant database tables? – mb1231 Jul 27 '21 at 09:58
  • 3
    Echoing the recommendation to use `SqlBulkCopy`, I would expect loading the raw data into a staging table and then calling a proc for the ETL will take seconds and worth the refactoring effort. Note that the proc should perform set-based queries, not RBAR. – Dan Guzman Jul 27 '21 at 09:58
  • 1
    Have a look at my answers to https://stackoverflow.com/questions/48905709/how-to-perform-bulk-upload-using-entity-framework-on-non-relational-tables/48906017#48906017 and https://stackoverflow.com/questions/42156362/console-application-slowing-down-over-time/42156942#42156942 showing SQLBulkCopy – Steve Ford Jul 27 '21 at 22:44

0 Answers0