1

I've been playing with @Cinchoo's fantastic ETL system for C#. I need to compare two CSV files, where one CSV file is defined as a dynamically growing master table and the other is a feeder "detail" table.

The detail table may have differences in terms of NEW records, CHANGED records, or a record no longer (DELETED) existing in the master CSV file.

The output should be a 3rd table that replaces or updates the master table - so it's a growing CSV file.

Both tables have unique ID columns and a header row.

MASTER CSV

ID,name
1,Danny
2,Fred
3,Sam

DETAIL

ID,name
1,Danny
          <-- record no longer exists
3,Pamela <-- name change
4,Fernando   <-- new record

So far I've been referring to this fiddle, and the code below:

using System;
using ChoETL;
using System.Linq;

public class Program
{
    public static void Main()
    {
        var input1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().ToArray();
        var input2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().ToArray();

        Console.WriteLine("NEW records\n");
        using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
        {
            output.Write(input2.OfType<ChoDynamicObject>().Except(input1.OfType<ChoDynamicObject>(), 
                                                                  new ChoDynamicObjectEqualityComparer(new string[] { "id" })));
        }
        
        Console.WriteLine("\n\nDELETED records\n");
        using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
        {
            output.Write(input1.OfType<ChoDynamicObject>().Except(input2.OfType<ChoDynamicObject>(), 
                                                                  new ChoDynamicObjectEqualityComparer(new string[] { "id" })));
        }
        
        Console.WriteLine("\n\nCHANGED records\n");
        using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
        {
            output.Write(input1.OfType<ChoDynamicObject>().Except(input2.OfType<ChoDynamicObject>(), 
                                                                  new ChoDynamicObjectEqualityComparer(new string[] { "id", "name" })));
        }
    }
    
    static string csv1 = @"
ID,name
1,Danny
2,Fred
3,Sam";
    
    static string csv2 = @"
ID,name
1,Danny
3,Pamela
4,Fernando";
}

OUTPUT

NEW records

ID,name
4,Fernando

DELETED records

ID,name
2,Fred

CHANGED records

ID,name
2,Fred
3,Sam

The CHANGED records is not working. As an added extra, I need a status so I want it to look like this:

CHANGED records
    
ID,name,status
1,Danny,NOCHANGE
2,Fred,DELETED
3,Pamela,CHANGED
4,Fernando,NEW

Thanks

Cinchoo
  • 6,088
  • 2
  • 19
  • 34
Fandango68
  • 4,461
  • 4
  • 39
  • 74

1 Answers1

1

Here is how you can do with Cinchoo ETL

            string csv1 = @"ID,name
1,Danny
2,Fred
3,Sam";

            string csv2 = @"ID,name
1,Danny
3,Pamela
4,Fernando";

            var r1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().ToArray();
            var r2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().ToArray();

            using (var w = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
            {
                var newItems = r2.OfType<ChoDynamicObject>().Except(r1.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
                    .Select(r => 
                    {
                        var dict = r.AsDictionary();
                        dict["Status"] = "NEW"; 
                        return new ChoDynamicObject(dict); 
                    }).ToArray();

                var deletedItems = r1.OfType<ChoDynamicObject>().Except(r2.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
                    .Select(r =>
                    {
                        var dict = r.AsDictionary();
                        dict["Status"] = "DELETED";
                        return new ChoDynamicObject(dict);
                    }).ToArray();

                var changedItems = r2.OfType<ChoDynamicObject>().Except(r1.OfType<ChoDynamicObject>(), ChoDynamicObjectEqualityComparer.Default)
                    .Except(newItems.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
                    .Select(r =>
                    {
                        var dict = r.AsDictionary();
                        dict["Status"] = "CHANGED";
                        return new ChoDynamicObject(dict);
                    }).ToArray();

                var noChangeItems = r1.OfType<ChoDynamicObject>().Intersect(r2.OfType<ChoDynamicObject>(), ChoDynamicObjectEqualityComparer.Default)
                    .Select(r =>
                    {
                        var dict = r.AsDictionary();
                        dict["Status"] = "NOCHANGE";
                        return new ChoDynamicObject(dict);
                    }).ToArray();

                var finalResult = Enumerable.Concat(newItems, deletedItems).Concat(changedItems).Concat(noChangeItems).OfType<dynamic>().OrderBy(r => r.ID);
                w.Write(finalResult);
            }

            Console.WriteLine();

Output:

ID,name,Status
1,Danny,NOCHANGE
2,Fred,DELETED
3,Pamela,CHANGED
4,Fernando,NEW

Sample fiddle: https://dotnetfiddle.net/mrHpFx

UPDATE #1:

Above approach will work for small CSV files. For large CSV files, you must avoid it. Rather approach it in stream manner. Sample fiddle shows how (Not fully tested, but it gives direction to do it.)

Sample fiddle: https://dotnetfiddle.net/mh6w44

UPDATE #2:

Now Cinchoo ETL (v1.2.1.33) comes with built-in API to compare the CSV files in simplified manner

var r1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().WithMaxScanRows(1).OfType<ChoDynamicObject>();
var r2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().WithMaxScanRows(1).OfType<ChoDynamicObject>();

using (var w = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
{
    foreach (var t in r1.Compare(r2, "ID", "name" ))
    {
        dynamic v1 = t.MasterRecord as dynamic;
        dynamic v2 = t.DetailRecord as dynamic;
        if (t.Status == CompareStatus.Unchanged || t.Status == CompareStatus.Deleted)
        {
            v1.Status = t.Status.ToString();
            w.Write(v1);
        }
        else 
        {
            v2.Status = t.Status.ToString();
            w.Write(v2);
        }
    }
}

Sample fiddle: https://dotnetfiddle.net/uPR5Sq

Cinchoo
  • 6,088
  • 2
  • 19
  • 34
  • Thank you, and thanks for replying to my other comment. Much appreciate it – Fandango68 Nov 28 '21 at 07:58
  • 1
    added sample fiddle to handle large CSV files. take a look at it. – Cinchoo Nov 28 '21 at 14:28
  • One problem I've encountered. How to compare the "master" CSV which has two extra columns - LastModifiedDate and Status, with the "detail" CSV that doesn't have those two columns? Sorry I am new to LINQ. Using your ETL I have to make both CSVs have the same number of columns? – Fandango68 Nov 29 '21 at 03:19
  • And thanks for the UPDATE. In my case I am talking less than 200 records. – Fandango68 Nov 29 '21 at 03:20
  • 1
    with the latest version v1.2.1.33, you can compare CSV files having additional columns. See `UPDATE #2` above. – Cinchoo Dec 12 '21 at 22:57
  • 1
    Brilliant! Thanks mate. – Fandango68 Dec 14 '21 at 05:24