0

We had performance issues in our app due to one by one updates of entities in a DB table where the number of rows was high (more than a million). We kept getting deadlock victim errors so it was obvious that the table was locking rows longer than it should have.

Currently, we implemented a manual batching of configurable limit/time threshold of stored procedure calls to update entities in the DB.

The simplified class would look something like this:

public class EntityBatchUpdater
{
    private readonly IRepository _repository;
    private List<Entity> _batch = new List<Entity>();

    private readonly Timer _batchPostingTimer;
    private readonly int _batchSize;

    private static readonly object _batchPostingLock = new object();

    public EntityBatchUpdater(IRepository repository)
    {
        _repository = repository;

        _batchSize = 1000; // configurable

        string batchPostingPeriod = "00:01:00.00"; // configurable
        _batchPostingTimer = new Timer
        {
            Interval = TimeSpan.Parse(batchPostingPeriod).TotalMilliseconds,
            Enabled = true,
            AutoReset = true,
        };
        _batchPostingTimer.Elapsed += OnTimedEvent;
    }

    public void Update(Entity entity)
    {
        try
        {
            lock (_batchPostingLock)
            {
                _batch.Add(entity);

                if (_batch.Count == _batchSize)
                {
                    EntityBatchUpdate();
                }
            }
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, $"Failed to insert batch {JsonConvert.SerializeObject(batch)}");
        }
    }

    private void EntityBatchUpdate()
    {
        if (_batch.Count == 0)
        {
            return;
        }              
        try
        {
            var entityBatchXML = SerializePayload();
            _repository.BatchUpdate(entityBatchXML);
            _batch.Clear();
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, $"Failed to insert batch; batchSize:{_batch.Count}");
        }
    }

    private string SerializePayload()
    {
        using (var sw = new System.IO.StringWriter())
        {
            XmlSerializerNamespaces ns = new XmlSerializerNamespaces();
            ns.Add("", "");
            var serializer = new XmlSerializer(typeof(List<Entity>),
                                               new XmlRootAttribute("ENTITY_BATCH"));
            serializer.Serialize(sw, _batch, ns);
            return sw.ToString();
        }
    }

    private void OnTimedEvent(object source, ElapsedEventArgs e)
    {
        EntityBatchUpdate();
    }
}

Currently, we took advantage of SQL Server's fast XML processing and serialize payload into XML when calling the actual procedure to avoid hitting the DB with a lot of calls. I also thought about creating a Table Valued Param to serialize the data we need to send to the proc, but I don't think that would drastically improve the performance.

My question is: How did you handle great load like this on your DB? 1.) Did you use a nuget package/some other tool to handle batching for you? 2.) Did you solve this using some other practice?

Edit: To give a bit more insight: We are currently processing a queue manually in our app (hence the huge number of updates), and we want to do it as fast and as reliable as possible. We will move to a better queueing mechanism in the future (RabbtiMQ or Kafka), but in the meantime, we want to have a standard approach of consuming and processing queues from a DB table.

Peter Sandor
  • 143
  • 1
  • 7
  • 2
    Why not use SQLBulkCopy like in this thread? [SQLBulkCopy](https://stackoverflow.com/questions/13722014/insert-2-million-rows-into-sql-server-quickly) – Tedy Levy Dec 21 '19 at 16:26
  • @TedyLevy thanks, good idea! haven't tried that, but will give it a go. Currently our approach works just fine with a couple thousands of rows and I wouldn't change the codebase for a hunch. I was mostly wondering what other people used in real-life apps to solve this issue. – Peter Sandor Dec 21 '19 at 16:38
  • @TedyLevy I guess my question wasn't clear enough. From a performance point of view, the above solution is good enough. I'm mostly interested from a code quality/readability/reusability point of view if there is a better approach to solve this issue. – Peter Sandor Dec 21 '19 at 16:40

1 Answers1

0

The most confusing part, is that you are doing this work from C# code in the first place. You work on > 1.000.000 Records. That is not a work you should be doing in code, ever. I do not really see what you are doing with those Records, but as a general just keep that scale of work in the Database. Always try to do as much filtering, inserting, bulk inserting, bulk updating and the like on the DB side.

Never move what could be SQL to a client programm. At best you add the network load of having to move the data over the network once or even twice (for Updates). At worst you add a huge danger for race conditions. And at no point will you have a chance to beat the time the DB server would need for the same operation.

This seems to be just a Bulk Insert. SQL has a seperate command for it and any DBMS worth it's Diskspace has the option to import data from variety of file formats, inlcluding .csv.

Christopher
  • 9,634
  • 2
  • 17
  • 31
  • I totally agree on keeping data related code as close to the DB as possible, that's why we have a stored procedure handling bulk insert/update operations. I was wondering if there was a cleaner way than TVPs or XML serialized batch data to do that (does Entity Framework or even Dapper help in any way with this?) – Peter Sandor Dec 21 '19 at 16:52
  • @PeterSandor Keeping it on the DB does not end with "Stored Procedure". It ends at "the DBMS will read in and process the file itself, no C# code involved". – Christopher Dec 21 '19 at 17:18