What is the fastest way to download rows from two related tables in packs?

Question

I have some problems with two massive related tables. First one has about 100 columns, second one about 300. The foreign key is on 5 columns. 100 million rows is nothing special in these tables.

My task is to download all the rows, process them and then upload. I need to download these tables in packs (10000 rows from parent table + all related to them rows from second table).

What would be the fastest way to do it?

Simplest solution would be downloading 10000 rows from parent table and then iterating through them to download related rows. Simple but I don't think it will be fast.

Other solution could be download with joining those two tables. Problem is that then I have to separate row in two parts, eliminate duplicates, etc. I also don't really know how fast this download would be.

So, my question is the same as in the title. What is the fastest way to download massive ammount of data from related tables in packs?

score 0 · Answer 1 · edited May 23 '17 at 10:33

0

I think the best solution here is to firstly download all the rows you require. SO all 1mil rows and put those into a List where Type is the actual type of the Table. This is easily done using a framework like NHibernate where you can map database structure to classes.

Once you have that then you can proceed by something like this: You have a number of batches lets say 10,000 per batch.

    int totalCount = LIST.Count;
                int batchSize = 10000;
                int numberOfBatches = (int)Math.Ceiling((decimal)totalCount / batchSize);

     for (int i = 0; i < numberOfBatches; i++)

                    {
    var currentBatch = LIST.Skip(i * batchSize).Take(batchSize);

CONTENT HERE.

}

So basically you will be updating x number of rows at a time in the database. I highly suggest you use NHibernate as opposed to SQLReader/Writers as they are much more efficient and tasks like insert/update/delete become trivial.

EDIT: Alternatively to NHibernate Update you can use BulkUpdate Have a look at Bulk Update in C#.

edited May 23 '17 at 10:33

Community

1
1

answered Aug 08 '14 at 09:27

Martino

51
1
7

Sorry, but I cannot download all required rows. It's not 1mil rows, but can be 100 mil and more. It just can not be done. – soshman Aug 08 '14 at 09:30
Then i would set my upper limit to lets say 1mil. You download 1 mil, do 100 iterations of the code i provided(with 10k batch) then update and then download the next 1 mil. Depending on how many fields your table has and the bits of your operating system and memory you might want to adjust the number of rows you download accordingly. – Martino Aug 08 '14 at 09:33
But how should I download that 1 mil? What is the fastest way? Even 1 mil of rows is quite big amout of data considering columns number. – soshman Aug 08 '14 at 09:34
i would use IDataReader or LINQ to read rows as they arrive. You can easily leave a query running while a variable gets bigger due to the search and processing x amount of rows using .Take(result number) as you go. – Martino Aug 08 '14 at 09:39
I do not like that aproach. This may cause a lot of different problems. Thanks for your help. I'll wait to see if there will be any other solutions. ;) – soshman Aug 08 '14 at 09:43

score 0 · Answer 2 · answered Aug 08 '14 at 11:27

The fastest way would be to use an ETL tool like SSIS to process the data on the server without transferring it to other machines.

SSIS allows batching, per-row processing of data streams with many built-in operations or even C# scripts, execution monitoring, handling of dirty data etc.

In ETL scenarios IO is the big killer, so transferring the data to other machines should be avoided. Connection latencies are another killer, so retrieving child record from a client machine is also going to kill performance.

A proper SQL statement like a join between two tables will perform orders of magnitude better than pulling the data to some other machine then pushing it back. Moreover, the database can optimize large selects and updates because it can select the proper query and update strategies and uses only the data that is actually used in selects or updates.

Finally, ORMs like NHibernate, EF or Linq to SQL should be avoided at all costs in ETL scenarios with even modest data sizes. The CPU and memory overhead of mapping thousands (much less millions) of objects is significant without providing any benefit.

What is the fastest way to download rows from two related tables in packs?

2 Answers2