0

I am using Entity Framework / Azure Functions in an attempt to keep a database up to date based on this GitHub repo of csvs.

The Process

  1. Timer function runs and checks a if a new commit is available in master
  2. If there's a new commit, upload each csv into Azure Blob Storage
  3. A Queue Trigger will be fired on each blob upload. Read the CSV using CsvHelper and upload it to the Database (see code below)

Objective

I want to improve step 4. Instead of completely dropping the table then recreating everything, I want to do some sort of a diff and then upsert (or delete) from there.

Code:

    [FunctionName("CsvQueueProcessor")]
    public async void Run([QueueTrigger("csvqueue", Connection = "AzureWebJobsStorage")]string myQueueItem, ILogger log)
    {
        var context = new ChadwickDbContext(Environment.GetEnvironmentVariable("DefaultConnectionString"));
        var csvEntity = JsonConvert.DeserializeObject<CsvProcessorEntity>(myQueueItem);
        if (csvEntity.FileName.ToLower() != "allstarfull.csv") return;
        log.LogInformation($"About to get the file from blob storage: {csvEntity.FileName}");
        var container = CloudBlobClient.GetContainerReference("csvs");
        await container.CreateIfNotExistsAsync();
        var blob = container.GetBlobReference(csvEntity.FileName);
        var stream = await blob.OpenReadAsync();
        var csv = new CsvReader(new StreamReader(stream));
        while(csv.Read())
        {
            var record = csv.GetRecord<AllStarFull>();
            context.AllStarFull.Add(record);
        }

        await context.SaveChangesAsync();
    }

Everything works great! But, how would I go about diffing? Are there any entity framework features that allow some kind of comparison or 'no action' if a record is already present? I ran across this thread, but I don't think it's quite what I want.

    public class AllStarFull
    {
        [Ignore]
        public int Id { get; set; }
        [Name("playerID")]
        public string PlayerId {get;set;}
        [Name("yearID")]
        public int YearId {get;set;}
        [Name("gameNum")]
        public int? GameNumber {get;set;}
        [Name("teamID")]
        public string TeamId {get;set;}
        [Name("lgID")]
        public string LeagueId {get;set;}
        [Name("GP")]
        public int? PlayedInGame { get; set; }
        [Name("startingPos")]
        public int? StartingPosition { get; set; }
    }
mwilson
  • 12,295
  • 7
  • 55
  • 95
  • 1
    I'd get the database to do the diff, and I wouldn't care to compare individual fields, I'd just use an SQL MERGE statement to either find the row or not based on the primary key. If it's there it is updated. If it's not, it is inserted. All in I consider it minimally better that your existing drop/create. If drop/create causes no issues I would leave it alone. Maybe you can create/rename/rename/drop to minimise the downtime or use http://msdn.microsoft.com/en-us/library/ms191160.aspx – Caius Jard May 04 '19 at 05:28
  • 1
    ps the reason I say leave it alone is because I doubt it would be faster to drag all of a table's data out over the network, onto the client, pick through it, and send data back bit by bit, have the db look it up, update it etc. If there are a million records and half are updated that's 1.5 million network transits plus tens of millions of compares of fields, versus just dump and transit 1 million records. Ensure your bulk load is as efficient as possible ? – Caius Jard May 04 '19 at 05:31
  • Yea, the drop/create seems the easiest. Some of these CSV's are quite large and are 20,000+ records. I'm quite impressed by how fast the code runs to create the table with that amount of data. So, I think given that, if I don't gain a lot by doing the diff logic, drop/create sounds like the best way to do it. I want to read more into the article you provided first, because partition switching or having some kind of A/B setup would certainly minimize downtime. – mwilson May 05 '19 at 00:26

0 Answers0