I have an ASP .NET MVC application, which I'm trying to write an import function for.
I do have some specifics, for example I am using Entity Framework v4 in an MVC application, but I am particularly concerned in an algorithm that would work the best, preferably with an explanation of what kind of performance it has, and why.
This operation is going to be performed asynchronously, so execution time is not as much of a factor as something like RAM use.
I should point out that there are several things (the database being the main one) that I have been forced to inherit and due to time constraints, will not be able to clean up till a later date.
Details
The import function is to take an in-memory CSV file (which has been exported from Sales Force and uploaded) and merge it into an existing database table. The process needs to be prepared to:
Update existing records which may have been altered in the CSV, without deleting an re-adding the database record so as to preserve the primary key of each record.
Add and remove any records as they change in the CSV file.
The current structure of the CSV and Database table are such that:
The Table and CSV both contain 52 columns.
Each column in the existing database schema is a VARCHAR(100) field; I am planning to optimise this, but cannot within the current time-frame.
Database back-end is MS SQL.
The CSV file has about 1700 rows worth of data in it. I can't see this number exceeding 5000, as there are already many duplicate entries, apparently.
Right now, I am only planning on actually importing 10 of those columns from the CSV, the rest of the table's fields will be left null, and I will be removing the unneeded columns at a later date.
The CSV file is being read into a DataTable to make it easier to work with.
I initially thought that the ContactID field in my Sales Force CSV was a unique identifier, although after doing some test imports, it seems that there are zero unique fields in the CSV file itself, at least that I can find.
Given that, I have been forced to add a primary key field to the Contacts table so that other tables can still maintain a valid relationship with any given contact. However, this obviously prevents me from simply deleting and re-creating the records on each import.
BEGIN EDIT
Its clear to me that what I was trying to achieve, perform updates on existing database records when no relationship exists between the table and the CSV, simply cannot be achieved.
It wasn't so much that I didn't know this beforehand, but more that I was hoping there was just some bright idea I hadn't thought of that could do this.
With that in mind, I ended up deciding just to make the assumption in my algorithm that ContactID is a unique identifier, and then see how many duplicates I ended up with.
I'm going a possible solution as an answer below. Both the algorithm and an actual implementation. I'll leave it for a few more days because I'd much prefer to accept someone else's better solution as the answer.
Here's some things I found after implementing my below solution:
- I had to narrow the rows provided by the CSV so that it matched those rows being imported into the database.
- The SqlDataReader is perfectly fine, what has the biggest impact is the individual UPDATE/INSERT queries that are performed.
- For a completely fresh import, the initial read of items into memory is not noticed by the UI, the insert process takes about 30 seconds to complete.
- There were only 15 duplicate IDs skipped on a fresh import, which is less than 1% of the total data set. I have deemed this to be an acceptable loss, as I am told the Sales Force database is going to have a clean-up anyway. I am hoping the IDs can be regenerated in these cases.
- I have not collected any resource metrics during the import, but in terms of speed this is OK, because of the progress bar I've implemented to provide feedback to the user.
END EDIT
Resources
Given the allocation size of each field, even with this relatively small number of records, I am concerned mostly about the amount of memory that might be allocated during the import.
The application will not be run in a shared environment, so there is room to breathe in that respect. Also, this particular function would only be run once a week or so, manually.
My aim is to at least be able to run comfortably on a semi-dedicated machine. Machine specs are variable as the application may eventually be sold as a product (though again, not targeted to a shared environment).
In terms of run-time for the import process it-self, as mentioned, this is going to be asynchronous and I have already put together some AJAX calls and a progress bar. So I would imagine that anywhere up to a minute or two would be OK.
Solution
I did find the following post which seems to be close to what I want:
Compare two DataTables to determine rows in one but not the other
It seems to me that performing lookups against a hashtable is the right idea. However, as mentioned, if I can avoid loading both the CSV and the Contacts table into memory entirely, that would be preferred, and I can't see avoiding it with the hashtable method.
One thing I am not sure how to achieve is how I might calculate a hash of each row to compare when one set of data is a DataTable object and the other is an EntitySet of Contact items.
I am thinking that unless I want to manually iterate over each column value in order to calculate the hash, I will need to have both data sets be the same object type, unless anyone has some fancy solutions.
Am I best to simply forget the Entity Framework for this procedure? I've certainly spent a lot of time trying to event remotely perform operations in bulk, so I'm more than happy to remove it from the equation.
If anything doesn't make sense or is missing, I apologise, I'm very tired. Just let me know and I'll fix it tomorrow.
I appreciate any help that can be offered, as I'm beginning to get desperate. I've spent more time agonising how to approach this than I had planned.
Thanks!