-3

ORIGINAL QUESTION:

I have some code which looks like this:

for (int i = start_i; i <= i_s; i++) 
{
    var json2 = JObject.Parse(RequestServer("query_2", new List<JToken>(){json1["result"]}));

    foreach (var data_1 in json2["result"]["data_1"])
    {
        var json3 = JObject.Parse(RequestServer("query_3", new List<JToken>(){data_1, 1}));

        foreach (var data_2 in json3["result"]["data_2"])
        {
            var data_1 = data_2["id"];
            var index = data_2["other"];
        }

        foreach (var other in json3["result"]["other"])
        {
            var data_3_1 = other["data_3"]["data_3_1"];
            var data_4 = other["data_4"];
            var data_5 = other["data_5"];

            foreach (var data_3_1 in other["data_3"]["data_3_1"])
            {
                //Console.WriteLine(data_3_1); <- very fast
                insert_data((string)data_3_1); <- very slow
            }
        }
    }
}

This code was able to generate about 5000 WriteLines in less than a minute. However, I now want to insert that data into a database. When I try to do that, the code now takes much much longer to get through the 5000 sets of data.

My question is, how do I batch the database inserts into about 1000 inserts at a time, instead of doing one at a time. I have tried creating the insert statement using a stringbuilder which is fine, what I can't figure out is how to generate 1000 at a time. I have tried using for loops upto 1000, and then trying to break out of the foreach loop, before starting with the next 1000, but it just makes a big mess.

I have looked at questions like this example, but they are no good for my loop scenario. I know how to do bulk inserts at the sql level, I just can't seem to figure out how to generate the bulk sql inserts using the unique loop situation I have above using the those very specific loops in the example code.

The 5000 records was just a test run. The end code will have to deal with millions, if not billions of inserts. Based on rough calculations, the end result will use about 500GB of drive space when inserted into a database, so I will need to batch an optimum amount into RAM before inserting into the database.


UPDATE 1:

This is what happens in insert_data:

public static string insert_data(string data_3_1)
{
    string str_conn = @"server=localhost;port=3306;uid=username;password=password;database=database";

    MySqlConnection conn = null;

    conn = new MySqlConnection(str_conn);
    conn.Open();

    MySqlCommand cmd = new MySqlCommand();
    cmd.Connection = conn;
    cmd.CommandText = "INSERT INTO database_table (data_3_1) VALUES (@data_3_1)";
    cmd.Prepare();

    cmd.Parameters.AddWithValue("@data_3_1", data_3_1);
    cmd.ExecuteNonQuery();
    cmd.Parameters.Clear();

    return null;
}
oshirowanen
  • 15,297
  • 82
  • 198
  • 350
  • 1
    What is the definition of `insert_data()`, seeing that is the only way anyone can help you – maccettura Aug 14 '20 at 14:58
  • Will update question with `insert_data()` as soon as I get home. But basically, all it does is insert the given data into a database. – oshirowanen Aug 14 '20 at 15:02
  • 2
    I told you yesterday, stage the data to insert in a list and once you're done processing the JSON response, do the bulk inserts grabbing 1000 items from the list at a time until the list is empty. – itsme86 Aug 14 '20 at 15:07
  • @itsme86, I get the feeling your solution will not work for the end product as the end product will have to deal with a lot more records than 5000. So I think I will run out of RAM using the list approach. I guess list could work for 5000 records in memory all at once. I will need the solution to be more memory friendly. Once inserted into the database, my rough calculations suggest the data will use up around 500GB of drive storage. I do not have that my memory to hold 500GB in ram before batching small sections of the list into the database. – oshirowanen Aug 14 '20 at 15:29
  • @maccettura, I have now added details about `insert_data()` as an update to the question above. – oshirowanen Aug 14 '20 at 19:31
  • Consider the technique of Rick James described in this URL. Rick calls it staging. http://mysql.rjweb.org/doc.php/staging_table – Wilson Hauck Aug 15 '20 at 18:22

3 Answers3

1

There's a nice and popular extension called MoreLinq that offers an extension method called Batch(int batchSize). To get an IEnumerable containing up to 1000 elements:

foreach (var upTo1000 in other["data_3"]["data_3_1"].Batch(1000))
{
    // Build a query using the (up to) 1000 elements in upTo1000
}
Eric J.
  • 147,927
  • 63
  • 340
  • 553
1

You're correct that doing bulk inserts in batches can be a big throughput win. Here's why it's a win: When you do INSERT operations one at a time, the database server does an implicit COMMIT operation after every insert, and that can be slow. So, if you can wrap every hundred or so INSERTs in a single transaction, you'll reduce that overhead.

Here's an outline of how to do that. I'll try to put it in the context of your code, but you didn't show your MySQLConnection object or query objects, so this solution of mine necessarily will be incomplete.

var batchSize = 100;
var batchCounter = batchSize;
var beginBatch = new MySqlCommand("START TRANSACTION;", conn);
var endBatch = new MySqlCommand("COMMIT;", conn);
beginBatch.ExecuteNonQuery();

for (int i = start_i; i <= i_s; i++) 
{
    ....
    foreach (var data_1 in json2["result"]["data_1"])
    {
        ...
        foreach (var other in json3["result"]["other"])
        {
           ...
            foreach (var data_3_1 in other["data_3"]["data_3_1"])
            {
                //Console.WriteLine(data_3_1); <- very fast
                /****************** batch handling  **********************/
                if ( --batchCounter <= 0) {
                    /* commit one batch, start the next */
                    endBatch.ExecuteNonQuery();
                    beginBatch.ExecuteNonQuery();
                    batchCounter = batchSize;
                }                   
                insert_data((string)data_3_1); <- very slow
            }
        }
    }
}
/* commit the last batch. It's OK if it contains no records */
endBatch.ExecuteNonQuery();

If you want, you can try different values of batchSize to find a good value. But generally something like the 100 I suggest works well.

Batch sizes of 1000 are also OK. But the larger each transaction gets, the more server RAM it uses before it's committed, and the longer it might block other programs using the same MySQL server.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • 1
    I have now added details about `insert_data()` as an update to the question above. Does this change anything? – oshirowanen Aug 14 '20 at 19:31
  • Ah, you are reopening your program's MySql connection for every single row you insert. No wonder it's slow. Open the connection just once, and use it for all your rows. And, you can also Prepare your statement just once, and reuse it for every row. – O. Jones Aug 15 '20 at 12:34
0

The best approach for me was using LOAD DATA LOCAL INFILE statement. To make it work first you have to turn ON MySQL server parameter local_infile. I used mysql2 package for NodeJS and query function:

db.query({
   sql: "LOAD DATA LOCAL INFILE .......",
   infileStreamFactory: <readable stream which provides your data in flat file format>
}, function(err, results) {....});

The trick is to provide a readable stream properly. By default, LOAD DATA expects tab delimited text file. Also LOAD DATA expects some file name and in you case if you provide a stream then file name can be arbitrary string.

slkorolev
  • 5,883
  • 1
  • 29
  • 32