I am trying to insert a large number of rows (>10,000,000) into a MySQL database using EF ObjectContext (db-first). After reading the answer of this question I wrote this code (batch save) to insert about 10,000 contacts (30k rows actually; including related other rows):
// var myContactGroupId = ...;
const int maxContactToAddInOneBatch = 100;
var numberOfContactsAdded = 0;
// IEnumerable<ContactDTO> contacts = ...
foreach (var contact in contacts)
{
var newContact = AddSingleContact(contact); // method excerpt below
if (newContact == null)
{
return;
}
if (++numberOfContactsAdded % maxContactToAddInOneBatch == 0)
{
LogAction(Action.ContactCreated, "Batch #" + numberOfContactsAdded / maxContactToAddInOneBatch);
_context.SaveChanges();
_context.Dispose();
// _context = new ...
}
}
// ...
private Contact AddSingleContact(ContactDTO contact)
{
Validate(contact); // Simple input validations
// ...
// ...
var newContact = Contact.New(contact); // Creates a Contact entity
// Add cell numbers
foreach (var cellNumber in contact.CellNumbers)
{
var existingContactCell = _context.ContactCells.FirstOrDefault(c => c.CellNo == cellNumber);
if (existingContactCell != null)
{
// Set some error message and return
return;
}
newContact.ContactCells.Add(new ContactCell
{
CellNo = cellNumber,
});
}
_context.Contacts.Add(newContact);
_context.ContactsInGroups.Add(new ContactsInGroup
{
Contact = newContact,
// GroupId = some group id
});
return newContact;
}
But it seems that the more contacts are added (batchwise), it takes more time (non linear). Here is the log for batch size 100 (10k contacts). Notice the increasing time needed as the batch# increases:
12:16:48 Batch #1
12:16:49 Batch #2
12:16:49 Batch #3
12:16:50 Batch #4
12:16:50 Batch #5
12:16:50 Batch #6
12:16:51 Batch #7
12:16:52 Batch #8
12:16:53 Batch #9
12:16:54 Batch #10
...
...
12:21:26 Batch #89
12:21:32 Batch #90
12:21:38 Batch #91
12:21:44 Batch #92
12:21:50 Batch #93
12:21:57 Batch #94
12:22:03 Batch #95
12:22:10 Batch #96
12:22:16 Batch #97
12:22:23 Batch #98
12:22:29 Batch #99
12:22:36 Batch #100
It took 6 mins 48 sec. If I increase the batch size to 10,000 (requires a single batch), it takes about 26 sec (for 10k contacts). But when I try to insert 100k contacts (10k per batch), it takes a long time (for the increasing time per batch I guess).
Can you explain why it is taking increasing amount of time despite of the context being renew-ed? Is there any other idea except raw SQL?