How to fetch,process and save huge record set in c# efficiently?

Question

I am trying to achieve below things:

get the data from SQL DB .
Pass data to PerformStuff method which has third party method MethodforResponse(It checks input and provide repsonse)
Save response(xml) back to SQL DB.

below is the sample code.performance wise its not good ,if there are 1000,000 Records in DB its very slow.

its there a better of doing it?any idea or hints to make it better.

please help.

using thirdpartylib;
 public class Program
    {

        static void Main(string[] args)
        {
            var response = PerformStuff();
            Save(response);


        }

        public class TestRequest
        {
            public int col1 { get; set; }
            public bool col2 { get; set; }
            public string col3 { get; set; }
            public bool col4 { get; set; }

            public string col5 { get; set; }
            public bool col6 { get; set; }
            public string col7 { get; set; }

        }
        public class TestResponse
        {
            public int col1 { get; set; }
            public string col2 { get; set; }
            public string col3 { get; set; }
            public int col4 { get; set; }

        }
        public TestRequest GetDataId(int id)
        {
            TestRequest testReq = null;
            try
            {
                SqlCommand cmd = DB.GetSqlCommand("proc_name");
                cmd.AddInSqlParam("@Id", SqlDbType.Int, id);
                SqlDataReader dr = new SqlDataReader(DB.GetDataReader(cmd));
                while (dr.Read())
                {
                    testReq = new TestRequest();

                    testReq.col1 = dr.GetInt32("col1");
                    testReq.col2 = dr.GetBoolean("col2");
                    testReq.col3 = dr.GetString("col3");
                    testReq.col4 = dr.GetBoolean("col4");
                    testReq.col5 = dr.GetString("col5");
                    testReq.col6 = dr.GetBoolean("col6");
                    testReq.col7 = dr.GetString("col7");



                }
                dr.Close();
            }

            catch (Exception ex)
            {
                throw;
            }
            return testReq;

        }
        public static TestResponse PerformStuff()
        {
            var response = new TestResponse();
            //give ids in list
            var ids = thirdpartylib.Methodforid()


            foreach (int id in ids)
            {

                var request = GetDataId(id);


                var output = thirdpartylib.MethodforResponse(request);

                foreach (var data in output.Elements())
                {
                    response.col4 = Convert.ToInt32(data.Id().Class());
                    response.col2 = data.Id().Name().ToString();

                }
            }
            //request details
            response.col1 = request.col1;
            response.col2 = request.col2;
            response.col3 = request.col3;

            return response;
        }

        public static void Save(TestResponse response)
        {

            var Sb = new StringBuilder();
            try
            {
                Sb.Append("<ROOT>");
                Sb.Append("<id");
                Sb.Append(" col1='" + response.col1 + "'");
                Sb.Append(" col2='" + response.col2 + "'");
                Sb.Append(" col3='" + response.col3 + "'");
                Sb.Append(" col4='" + response.col4 + "'");

                Sb.Append("></Id>");
                Sb.Append("</ROOT>");
                var cmd = DB.GetSqlCommand("saveproc");
                cmd.AddInSqlParam("@Data", SqlDbType.VarChar, Sb.ToString());
                DB.ExecuteNoQuery(cmd);

            }
            catch (Exception ex)
            {

                throw;
            }
        }

    }

Thanks!

I would suggest to split up the records in smaller chunks. Stored procedure should emit only a subset and offer the client to implement some kind of paging. As example the client could request records from a pafticular range (0 to 500). In a second resultset the procedure can tell the caller how many records are still left. — Andi Kleinbichler, Jan 11 '19 at 19:41
Only way I can think of speed thing up is if the third party function would accept multiple requests object per invocation. Otherwise you are stuck with sequential operation. You could try to spawn multiple worker threads so multiple records may be processed concurrently. — Leng, Jan 11 '19 at 19:44
From reading your sample code, it looks like you read many records and write just one. Your PerformStuff method gets called once and returns just one TestResponse method. Is that true? — O. Jones, Jan 11 '19 at 19:51

Loufs · Answer 1 · 2019-01-11T19:44:13.760

Your question is very broad and the method PerformStuff() will be fundamentally slow because it takes O(n) * db_lookup_time before another iteration of the output. So, to me it seems you're going about this problem the wrong way.

Database query languages are made to optimize data traversal. So iterating by id, and then checking values, goes around this producing the slowest lookup time possible.

Instead, leverage SQL's powerful query language and use clauses like where id < 10 and value > 100 because you ultimately want to limit the size of the data set needed to be processed by C#.

So:

Read just the smallest amount data you need from the DB
Process this data as a unit, parallelism might help.
Write back modifications in one DB connection.

Hope this sets you in the right direction.

score 0 · Answer 2 · answered Jan 11 '19 at 19:40

I think the root of your problem is that you get and insert data record-by-record. There is no possible way to optimize it. You need to change the approach in general.

You should think of a solution that: 1. Gets all the data in one command to the database. 2. Process it. 3. Save it back to the database in one command, using a technique like BULK INSERT. Please be ware that BULK INSERT has certain limitations, so read the documentation carefully.

score 0 · Answer 3 · answered Jan 11 '19 at 19:57

Based on your comment, there are multiple things you can enhance in your solution, from memory consumption to CPU usage.

Take advantage of paging at the database level. Do not fetch all records at once, to avoid having memory leaks and/or high memory consumption in cases of 1+ million records, rather take chunk by chunk and do whatever you need to do with it.
Since you don't need to save XML into a database, you can choose to save response into the file. Saving XML into file gives you an opportunity to stream data onto your local disc.
Instead of assembling XML by yourself, use XmlSerializer to do that job for you. XmlSerializer works nicely with XmlWriter which in the end can work with any stream including FileStream. There is a thread about it, which you can take as an example.

To conclude, PerformStuff method won't be only faster, but it will require way fewer resources (memory, CPU) and the most important thing, you'll be easily able to constraint resource consumption of your program (by changing the size of database page).

O. Jones · Answer 4 · 2019-01-11T20:21:06.983

Observation: your requirement looks like it matches the map / reduce pattern.

If the values in your ids collection returned by thirdpartylib.Methodforid() are reasonably dense, and the number of rows in the table behind your proc_name stored procedure has close to the same number of items in the ids collection, you can retrieve all the records you need with a single SQL query (and a many-row result set) rather than retrieving them one by one. That might look something like this:

public static TestResponse PerformStuff()
{
    var response = new TestResponse();

    var idHash = new HashSet<int> (thirdpartylib.Methodforid());

    SqlCommand cmd = DB.GetSqlCommand("proc_name_for_all_ids");
    using (SqlDataReader dr = new SqlDataReader(DB.GetDataReader(cmd)) { 
        while (dr.Read()) {
            var id = dr.GetInt32("id");
            if (idHash.Contains(id)) {
                testReq = new TestRequest();

                testReq.col1 = dr.GetInt32("col1");
                testReq.col2 = dr.GetBoolean("col2");
                testReq.col3 = dr.GetString("col3");
                testReq.col4 = dr.GetBoolean("col4");
                testReq.col5 = dr.GetString("col5");
                testReq.col6 = dr.GetBoolean("col6");
                testReq.col7 = dr.GetString("col7");

                var output = thirdpartylib.MethodforResponse(request);
                foreach (var data in output.Elements())  {
                    response.col4 = Convert.ToInt32(data.Id().Class());
                    response.col2 = data.Id().Name().ToString();
                }
            } /* end if hash.Contains(id) */  
        }  /* end while dr.Read() */
    } /* end using() */
    return response;
}

Why might this be faster? It makes many fewer database queries, and instead streams in the multiple rows of data to process. This will be far more efficient than your example.

Why might it not work?

if the id values must be processed in the same order produced by thirdpartylib.Methodforid() it won't work.
if there's no way to retrieve all the rows, that is no proc_name_for_all_ids stored procedure available, you won't be able to stream the rows.

How to fetch,process and save huge record set in c# efficiently?

4 Answers4