Advice on ways to merge CSV records into an SQL Table in ASP .NET

Question

I have an ASP .NET MVC application, which I'm trying to write an import function for.

I do have some specifics, for example I am using Entity Framework v4 in an MVC application, but I am particularly concerned in an algorithm that would work the best, preferably with an explanation of what kind of performance it has, and why.

This operation is going to be performed asynchronously, so execution time is not as much of a factor as something like RAM use.

I should point out that there are several things (the database being the main one) that I have been forced to inherit and due to time constraints, will not be able to clean up till a later date.

Details

The import function is to take an in-memory CSV file (which has been exported from Sales Force and uploaded) and merge it into an existing database table. The process needs to be prepared to:

Update existing records which may have been altered in the CSV, without deleting an re-adding the database record so as to preserve the primary key of each record.
Add and remove any records as they change in the CSV file.

The current structure of the CSV and Database table are such that:

The Table and CSV both contain 52 columns.
Each column in the existing database schema is a VARCHAR(100) field; I am planning to optimise this, but cannot within the current time-frame.
Database back-end is MS SQL.
The CSV file has about 1700 rows worth of data in it. I can't see this number exceeding 5000, as there are already many duplicate entries, apparently.
Right now, I am only planning on actually importing 10 of those columns from the CSV, the rest of the table's fields will be left null, and I will be removing the unneeded columns at a later date.
The CSV file is being read into a DataTable to make it easier to work with.
I initially thought that the ContactID field in my Sales Force CSV was a unique identifier, although after doing some test imports, it seems that there are zero unique fields in the CSV file itself, at least that I can find.
Given that, I have been forced to add a primary key field to the Contacts table so that other tables can still maintain a valid relationship with any given contact. However, this obviously prevents me from simply deleting and re-creating the records on each import.

BEGIN EDIT

Its clear to me that what I was trying to achieve, perform updates on existing database records when no relationship exists between the table and the CSV, simply cannot be achieved.

It wasn't so much that I didn't know this beforehand, but more that I was hoping there was just some bright idea I hadn't thought of that could do this.

With that in mind, I ended up deciding just to make the assumption in my algorithm that ContactID is a unique identifier, and then see how many duplicates I ended up with.

I'm going a possible solution as an answer below. Both the algorithm and an actual implementation. I'll leave it for a few more days because I'd much prefer to accept someone else's better solution as the answer.

Here's some things I found after implementing my below solution:

I had to narrow the rows provided by the CSV so that it matched those rows being imported into the database.
The SqlDataReader is perfectly fine, what has the biggest impact is the individual UPDATE/INSERT queries that are performed.
For a completely fresh import, the initial read of items into memory is not noticed by the UI, the insert process takes about 30 seconds to complete.
There were only 15 duplicate IDs skipped on a fresh import, which is less than 1% of the total data set. I have deemed this to be an acceptable loss, as I am told the Sales Force database is going to have a clean-up anyway. I am hoping the IDs can be regenerated in these cases.
I have not collected any resource metrics during the import, but in terms of speed this is OK, because of the progress bar I've implemented to provide feedback to the user.

END EDIT

Resources

Given the allocation size of each field, even with this relatively small number of records, I am concerned mostly about the amount of memory that might be allocated during the import.

The application will not be run in a shared environment, so there is room to breathe in that respect. Also, this particular function would only be run once a week or so, manually.

My aim is to at least be able to run comfortably on a semi-dedicated machine. Machine specs are variable as the application may eventually be sold as a product (though again, not targeted to a shared environment).

In terms of run-time for the import process it-self, as mentioned, this is going to be asynchronous and I have already put together some AJAX calls and a progress bar. So I would imagine that anywhere up to a minute or two would be OK.

Solution

I did find the following post which seems to be close to what I want:

Compare two DataTables to determine rows in one but not the other

It seems to me that performing lookups against a hashtable is the right idea. However, as mentioned, if I can avoid loading both the CSV and the Contacts table into memory entirely, that would be preferred, and I can't see avoiding it with the hashtable method.

One thing I am not sure how to achieve is how I might calculate a hash of each row to compare when one set of data is a DataTable object and the other is an EntitySet of Contact items.

I am thinking that unless I want to manually iterate over each column value in order to calculate the hash, I will need to have both data sets be the same object type, unless anyone has some fancy solutions.

Am I best to simply forget the Entity Framework for this procedure? I've certainly spent a lot of time trying to event remotely perform operations in bulk, so I'm more than happy to remove it from the equation.

If anything doesn't make sense or is missing, I apologise, I'm very tired. Just let me know and I'll fix it tomorrow.

I appreciate any help that can be offered, as I'm beginning to get desperate. I've spent more time agonising how to approach this than I had planned.

Thanks!

every table in salesforce has a primary key column called Id. so you should be able to have a unique key for each row in your CSV. Also, depending exactly on how you're generating the CSV, it may have Ids that are 15 chars in length, these are case sensitive, e.g. 00a & 00A refer to different records. — superfell, Feb 15 '11 at 16:40
@superfell The case sensitive fields you refer to are in fact used in the ContactID and AccountID fields, but even when I modify the coallation of the respective DB columns, I run into issues with non-unique ContactIDs. A grep of the CSV confirms this. I will have to look into including the ID column you speak of into the CSV, though. — Geekman, Feb 15 '11 at 23:13
@superfell That is unknown to me at this point. Right now, I'm simply working off an example that was sent to me in order to get the function working. — Geekman, Feb 16 '11 at 00:17
If you are using SQL2005+, you can use a temp table and the MERGE command. Greatly simplifies the code. — Keith, Jan 30 '13 at 16:12

Joel Mansford · Accepted Answer · 2011-02-18T12:45:39.640

Depending on your timescales I would (and do) simply use DBAmp by Forceamp. This presents as an OLE DB driver and thus is used as a linked server within SQL Server.

The standard use of the tool is to use the supplied stored procedures to replicate / refresh your Salesforce schema to SQL Server. I do this in some very large environments and it's efficient enough to refresh every 15mins without overlapping.

DBAmp maintains the column types in the underlying SQL Server tables.

One last point, beware of 15char Salesforce IDs (SObject IDs). These are only unique when compared case-sensitively. Salesforce reports typically output 15char IDs but API dumps are usually 18char case-insensitive IDs. More info about conversion etc here. If you are still seeing collisions when comparing case-sensitively I would be inclined to think it's some pre-processing being done on the file or possibly an error in the report being used to export.

Further to your comment, Salesforce IDs are globally unique, that is they don't repeat between different customer's production orgs. Thus even if you are pulling in records from multiple orgs they shouldn't collide. Full-copy sandbox orgs do have identical IDs to that of the 'master' production org.

If you are interested in using the API directly checkout the Salesforce.Net library which is pretty good to get you started.

+1 for the info on the API's IDs. I am going to be moving to use the API some time in the coming months, so its good to know that the UD is longer (with less chance of a collision?). I did already realise the ID is case-sensitive, and have changed the coallation of the column to be case-sensitive. Any idea how to avoid duplicates in this case? — Geekman, Feb 18 '11 at 02:29

score 0 · Answer 2 · answered Feb 15 '11 at 22:32

Based on the fact you'd be dealing with no more than 5000 rows at a time, I would be inclined to use ADO.Net (probably a SQLDataReader) only to get the data into objects. WRT primary keys - I don't know about the details of Salesforce-exported data but c.f @superfell's comment. If not, you can generate your own PKs for the objects).

I could then use the methods available to the List<T> class to filter/iterate through the rows by comparing successive fields and so on.

This is mostly motivated by the fact that my C# is many times better than my SQL ;-)

Good luck.

Geekman · Answer 3 · 2011-02-18T07:02:09.990

I have come up with the following possible solution. After implementing it, using my test CSV, I found that there were only 15 duplicate ContactIDs.

Given that this is less than 1% of the current set of contacts, I deemed it acceptable.

For this to work, I've had to make the columns provided by the CSV equal to those imported by the application, otherwise comparison will obviously fail.

Here's the algorithm I've put together:

        /* Algorithm:
         * ----------
         * I'm making the assumption that the ContactID field is going to be unique, and if not, I will ignore any duplicates.
         * The reason for this is that I don't see a way to be able to update an existing database record with the changes in the CSV
         * unless a relationship exists to indicate what CSV record relates to what database record.
         * 
         * - Load DB table into memory
         * - Load CSV records into memory
         * - For each record in the CSV:
         *      - Add this record's contact ID to a list of IDs which need to remain the DB. 
         *        If it already exists in the list, we have a duplicate ID. Skip.
         *        
         *      - Concatenate CSV column values into a single string, store for later comparison.
         *      
         *      - Select the top record from the DB DataTable where: the ContactID field in the DB record matches that in the CSV.
         *      
         *      - If no DB records were found
         *          - Add this new record to the DB.
         *          
         *      - Concatenate column values for the DB record and compare this to the string generated previously.
         *      - If the strings match, skip any further processing
         *       
         *      - For each column in the CSV record:
         *          - Compare against the value for the same column in the DB record.
         *          - If values do not match, use StringBuilder to add to your UPDATE query for this record.
         *          
         * 
         * - Now we need to clean out the records from the DB which no longer exist in the CSV. Use the previously built list of ContactIDs.
         * - For each record in the DB:
         *      - If the ContactID in the DB record is not in your list, use a StringBuilder to add this ID to a DELETE statement. eg. OR [ContactID] = ...
         *      
         */

And here's my implementation:

public class ContactImportService : ServiceBase
{

    private DataTable csvData;

    //...   

    public void DifferentialImport(Guid ID)
    {

        //This is a list of ContactIDs which we come across in the CSV during processing.
        //Any records in the DB which have an ID not in this list will be deleted.
        List<string> currentIDs = new List<string>();

        lock (syncRoot)
        {
            jobQueue[ID].TotalItems = (short)csvData.Rows.Count;
            jobQueue[ID].Status = "Loading contact records";
        }

        //Load existing data into memory from Database.
        SqlConnection connection = 
            new SqlConnection(Utilities.ConnectionStrings["MyDataBase"].ConnectionString);
        SqlCommand command = new SqlCommand("SELECT " +
                "[ContactID],[FirstName],[LastName],[Title]" +
                // Etc...
                "FROM [Contact]" +
                "ORDER BY [ContactID]", connection);

        connection.Open();
        SqlDataReader reader = command.ExecuteReader(CommandBehavior.CloseConnection);
        DataTable dbData = new DataTable();
        dbData.Load(reader);
        reader = null;

        lock (syncRoot)
        {
            jobQueue[ID].Status = "Merging records";
        }

        int affected = -1;
        foreach (DataRow row in csvData.Rows)
        {
            string contactID = row["ContactID"].ToString();
            //Have we already processed a record with this ID? If so, skip.
            if (currentIDs.IndexOf(contactID) != -1)
                break;

            currentIDs.Add(row["ContactID"].ToString());

            string csvValues = Utilities.GetDataRowString(row);

            //Get a row from our DB DataTable with the same ID that we got previously:
            DataRow dbRecord = (from record in dbData.AsEnumerable()
                            where record.Field<string>("ContactID") == contactID
                            select record).SingleOrDefault();

            //Found an ID not in the database yet... add it.
            if (dbRecord == null)
            {
                command = new SqlCommand("INSERT INTO [Contact] " +
                    "... VALUES ...", connection);
                connection.Open();
                affected = command.ExecuteNonQuery();
                connection.Close();
                if (affected < 1)
                {
                    lock (syncRoot)
                    {
                        jobQueue[ID].FailedChanges++;
                    }
                }
            }

            //Compare the DB record with the CSV record:
            string dbValues = Utilities.GetDataRowString(dbRecord);

            //Values are different, we need to update the DB to match.
            if (csvValues == dbValues)
                continue;

            //TODO: Dynamically build the update query based on the specific columns which don't match using StringBulder.
            command = new SqlCommand("UPDATE [Contact] SET ... WHERE [Contact].[ContactID] = @ContactID");
            //...
            command.Parameters.Add("@ContactID", SqlDbType.VarChar, 100, contactID);
            connection.Open();
            affected = command.ExecuteNonQuery();
            connection.Close();

            //Update job counters.
            lock (syncRoot)
            {
                if (affected < 1)
                    jobQueue[ID].FailedChanges++;
                else
                    jobQueue[ID].UpdatedItems++;
                jobQueue[ID].ProcessedItems++;
                jobQueue[ID].Status = "Deleting old records";
            }

        } // CSV Rows

        //Now that we know all of the Contacts which exist in the CSV currently, use the list of IDs to build a DELETE query
        //which removes old entries from the database.
        StringBuilder deleteQuery = new StringBuilder("DELETE FROM [Contact] WHERE ");

        //Find all the ContactIDs which are listed in our DB DataTable, but not found in our list of current IDs.
        List<string> dbIDs = (from record in dbData.AsEnumerable()
                              where currentIDs.IndexOf(record.Field<string>("ContactID")) == -1
                              select record.Field<string>("ContactID")).ToList();

        if (dbIDs.Count != 0)
        {
            command = new SqlCommand();
            command.Connection = connection;
            for (int i = 0; i < dbIDs.Count; i++)
            {
                deleteQuery.Append(i != 0 ? " OR " : "");
                deleteQuery.Append("[Contact].[ContactID] = @" + i.ToString());
                command.Parameters.Add("@" + i.ToString(), SqlDbType.VarChar, 100, dbIDs[i]);
            }
            command.CommandText = deleteQuery.ToString();

            connection.Open();
            affected = command.ExecuteNonQuery();
            connection.Close();
        }

        lock (syncRoot)
        {
            jobQueue[ID].Status = "Finished";
        }

        remove(ID);

    }

}

The SqlDataReader seems sufficient, it is the individual UPDATE queries which take the bulk of the time, all other operations are negligible in comparison.

I would say at this point it takes about 30 seconds to do a new import, where all records must be imported. With the progress feedback I have implemented, this is fast enough for the end-user.

I haven't measured any resource use yet.

Advice on ways to merge CSV records into an SQL Table in ASP .NET

3 Answers3