0

I have a document with 3 million lines that I have to process and then insert into a table on a database. The database connector class have an insert and an update that receive the data and then make the operation. The insert and update work perfectly.

The problem is that I want to split this modifications to the database to avoid overloading the DB. Ideally I would love to make automatic splits of n lines of arrFinal and then manage each part separately. But I don't know how to do this.

This is the foreach I have right now:

ConnectDB cdb = new ConnectDB();

foreach (string s in arrFinal)
{
    if (s.Split(';')[1] == "REQUEST")
    {
        cdb.Insert(s.Split(';')[0], s.Split(';')[2], s.Split(';')[3], s.Split(';')[4], s.Split(';')[5], dateLog);
    }
    else if (s.Split(';')[1] == "RESPONSE")
    {
        cdb.Update(s.Split(';')[0], s.Split(';')[5]);
    }
}

In case you wonder how data comes:

00:00:00.7443;REQUEST;POST;https://ulrName/Prices/;/1_0/962;https://ulrName/Prices/1_0/962
00:00:00.7793;RESPONSE;POST;https://ulrName/Prices/;/1_0/962;https://ulrName/Prices/1_0/962

Thank you in advance for your help. I'm open to try any approach to get this right.

An3SC
  • 31
  • 1
  • 8
  • 1
    Hello. Assuming arrFinal is your final array with 3 million lines. If possible, maybe it would be better not collect that many records/lines in RAM. Instead maybe you could divide "the reading from document" into smaller jobs with parameters like startingLineIndex and LineCountLimit etc. Then create function to read pre-determine number of lines and then apply insert/update lines, then read that number of lines again then insert/update etc. This might be better practice then reading all lines then try and process them. – AntiqTech Dec 07 '21 at 12:07

2 Answers2

1

Well, for starters you really do not need all the splits - why recalculate something so many times?

ConnectDB cdb = new ConnectDB();

foreach (string s in arrFinal)
{
    var data = s.Split(';')
    if (data[1] == "REQUEST")
    {
        cdb.Insert(data[0], data[2], data[3], data[4], data[5], dateLog);
    }
    else if (data[1] == "RESPONSE")
    {
        cdb.Update(data[0], data[5]);
    }
}

You could also use something like spans.

As for splitting the array:

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int batchSize)
{
    using var enumerator = source.GetEnumerator();
    while (enumerator.MoveNext()) 
    {
        yield return YieldBatchElements(enumerator, batchSize - 1);
    }
}

private static IEnumerable<T> YieldBatchElements<T>(IEnumerator<T> source, int batchSize)    
{
    yield return source.Current;    for (var i = 0; i < batchSize && source.MoveNext(); i++)    
    yield return source.Current;
}

To get all the results at once batched, just append .ToList() at the end when using Batch() extension method.

Another thing would be to try and load part of the file and process it, then another part, another etc, like AntiqTech mentioned but it would require some form of saving the current processed amount in case of app failure.

quain
  • 861
  • 5
  • 18
  • 1
    Thank you for your answer, quain!! I've taken a different approach, I'm publishing what worked for me in this case. I also want to thank you for your advice on the many splits I did. – An3SC Dec 07 '21 at 12:57
0

I've finally managed to accomplish what I wanted thanks to this answer in another post:

How to split an array into chunks of specific size?

My final code goes like this:

String[][] chunks = arrFinal
                    .Select((s, i) => new { Value = s, Index = i })
                    .GroupBy(x => x.Index / 500)
                    .Select(grp => grp.Select(x => x.Value).ToArray())
                    .ToArray();

ConnectDB cdb = new ConnectDB();

for (int i = 0; i < chunks.Length; i++)
{
    foreach (string s in arrFinal)
    {
        var data = s.Split(';'); 
        if (data[1] == "REQUEST")
        {
            cdb.Insert(data[0], data[2], data[3], data[4], data[5], dateLog);
        }
        else if (data[1] == "RESPONSE")
        {
            cdb.Update(data[0], data[5]);
        }
    }
}
An3SC
  • 31
  • 1
  • 8