Compare and get difference of 2 CSV files with C#

Question

I'm reading a CSV file a few times a day. It's about 300MB and each time I have to read it through, compare with existing data in the database, add new ones, hide old ones and update existing ones. There are also bunch of data that's not getting touch.

I have access to all files both old and new ones and I'd like to compare new one with the previous one and just update what's changed in the file. I have no idea what to do and I'm using C# to do all my work. The one thing that might be most problematic is that a row in the previous field might be in another location in the second feed even if it's not updated at all. I want to avoid that problem as well, if possible.

Any idea would help.

Are you familiar with how to create a `Delta` also how are you storing the data that you want to compare...? can you not dump the data into a datatable and do your comparison that way..or create a temp table on the server and create some small routine that compares the database for any add's updates or deletes..? — MethodMan, Sep 15 '14 at 19:07
Is there a modified date date field? i.e. is there an easy way to tell when a row was modified or created? — Kyle, Sep 15 '14 at 19:08
@DJKRAZE, no I'm not familiar with Delta. The data is converted into an object with some modification so I don't think comparing DB is an option for me. I'd like to avoid reading and parsing whole file and insert/delete/update them in the DB since it takes too long. — erincerol, Sep 15 '14 at 19:11
then you can do this at the class level and I am sure you can use the IComparable approach as well. this will take some work on your part this is not a quick fix as they say.. — MethodMan, Sep 15 '14 at 19:13
@DJKRAZE I know it might some work, but I need some guidance on how to do since it's not something I'm familiar with. — erincerol, Sep 15 '14 at 19:14
If there's no Modified_Date field, there's no way to tell if a field is edited for a particular row. You probably just have to compare one row at a time. — Kyle, Sep 15 '14 at 21:22

score 2 · Answer 1 · edited May 23 '17 at 11:58

Use one of the existing CSV parsers
Parse each row to a mapped class object
Override Equals and GetHashCode for your object
Keep a List<T> or HashSet<T> in memory, At the first step initialize them with no contents.
On reading each line from the CSV file, check if the exist in your in-memory collection (List, HashSet)
If the object doesn't exists in your in-memory collection, add it to the collection and insert in database.
If the object exists in your in-memory collection then ignore it (Checking for it would be based on Equals and GetHashCode implementation and then it would be as simple as if(inMemoryCollection.Contains(currentRowObject))

I guess you have a windows service reading CSV files periodically from a file location. You can repeat the above process, every time you read a new CSV file. This way you will be able to maintain an in-memory collection of the previously inserted objects and ignore them, irrespective of their place in the CSV file.

If you have primary key, defined for your data then you can use Dictionary<T,T>, where you Key could be the unique field. This will help you in having more performance for comparison and you can ignore Equals and GetHashCode implementation.

As a backup to this process, your DB writing routine/stored procedure should be defined in a way that it would first check, if the record already exists in the table, in that case Update the table otherwise INSERT new record. This would be UPSERT.

Remember, if you end up maintaining an in-memory collection, then keep clearing it periodically, otherwise you could end up with out of memory exception.

Thanks, I'll have a look at this approach. – erincerol Sep 15 '14 at 20:17 — erincerol, Sep 15 '14 at 20:17

score 0 · Answer 2 · answered Sep 15 '14 at 19:18

How big is the csv file?? if its small try the following

string [] File1Lines = File.ReadAllLines(pathOfFileA);
      string [] File2Lines = File.ReadAllLines(pathOfFileB);
      List<string> NewLines = new List<string>();
      for (int lineNum = 0; lineNo < File1Lines.Length; lineNo++)
      {
        if(!String.IsNullOrEmpty(File1Lines[lineNum]) 
 String.IsNullOrEmpty(File2Lines[lineNo]))
        {
          if(String.Compare(File1Lines[lineNo], File2Lines[lineNo]) != 0)
            NewLines.Add(File2Lines[lineNo]) ;
        }
        else if (!String.IsNullOrEmpty(File1Lines[lineNo]))
        {
        }
        else
        {
          NewLines.Add(File2Lines[lineNo]);
        }
      }
      if (NewLines.Count > 0)
      {
        File.WriteAllLines(newfilepath, NewLines);
      }

Like I said, it's about 300MB. – erincerol Sep 15 '14 at 20:15 — erincerol, Sep 15 '14 at 20:15

Kyle · Accepted Answer · 2014-09-16T13:55:51.270

Just curious, why do you have to compare the old file with the new file? Isn't the data from the old file in SQL server already? (When yous say database, you mean SQL server right? I'm assuming SQL server because you use C# .net)

My approach is simple:

Load new CSV file into a staging table

Use stored procs to insert, update, and set inactive files

public static void ProcessCSV(FileInfo file)
{
    foreach (string line in ReturnLines(file))
    {
        //break the lines up and parse the values into parameters
        using (SqlConnection conn = new SqlConnection(connectionString))
        using (SqlCommand command = conn.CreateCommand())
        {
            command.CommandType = CommandType.StoredProcedure;
            command.CommandText = "[dbo].sp_InsertToStaging";

            //some value from the string Line, you need to parse this from the string
            command.Parameters.Add("@id", SqlDbType.BigInt).Value = line["id"];
            command.Parameters.Add("@SomethingElse", SqlDbType.VarChar).Value = line["something_else"];

            //execute
            if (conn.State != ConnectionState.Open)
                conn.Open();

            try
            {
                command.ExecuteNonQuery();
            }
            catch (SqlException exc)
            {
                //throw or do something
            }
        }
    }
}

public static IEnumerable<string> ReturnLines(FileInfo file)
{
    using (FileStream stream = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
    using (StreamReader reader = new StreamReader(stream))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}

Now you write stored procs to insert, update, set inactive fields based on Ids. You'll know if a row is updated if Field_x(main_table) != Field_x(staging_table) for a particular Id, and so on.

Here's how you detect changes and updates between your main table and staging table.

/* SECTION: SET INACTIVE */
UPDATE main_table
SET IsActiveTag = 0
WHERE unique_identifier IN
    (
        SELECT a.unique_identifier
        FROM main_table AS a INNER JOIN staging_table AS b
        --inner join because you only want existing records
        ON a.unique_identifier = b.unique_identifier
        --detect any updates
        WHERE a.field1 <> b.field2
            OR a.field2 <> b.field2
            OR a.field3 <> b.field3
            --etc
    )

/* SECTION: INSERT UPDATED AND NEW */
INSERT INTO main_table
SELECT *
FROM staging_table AS b
LEFT JOIN
    (SELECT *
    FROM main_table
    --only get active records
    WHERE IsActiveTag = 1) AS a
ON b.unique_identifier = a.unique_identifier
--select only records available in staging table
WHERE a.unique_identifier IS NULL

The reason is loading data to DB takes too long. Currently, I'm comparing new file data with DB and I do insert/update and sometiems delete. If I'm able to get difference between files somehow, I can just work on that different data, rather than working with whole file and db. — erincerol, Sep 17 '14 at 10:37

Compare and get difference of 2 CSV files with C#

3 Answers3