3

I am requesting the data from some kind of Products API, but the thing is that I am getting it 20 by 20. So the endpoint looks like this:

https://www.someDummyAPI.com/Api/Products?offset=0&count=20

Note: I can't change the count, it will always be 20.

I.e. The data from this endpoint will contain 20 record, from 0 to 20 and after that I have to increase offset by 20 to get next 20 record and so on (totally it's about 1500 record so I have to make approximately 700 request ).

After getting all the data I am inserting it into the SQL database using stored procedure (this is different process).

So my question is, how can I speed up the fetching process, I thought about running tasks in parallel but I need to get results from the response.

For now this process looks like this :

    protected async void FSL_Sync_btn_Click(object sender, EventArgs e)
    {
        int offset = 0;
        int total= 0;
        bool isFirst = true;
        DataTable resTbl = CreateDt();
        while (offset < total || offset == 0)
        {
            try
            {
                var data = await GetFSLData(offset.ToString(),"Products");

                JObject Jresult = JObject.Parse(data);

                if (isFirst)
                {
                    Int32.TryParse(Jresult.SelectToken("total").ToString(),out total);
                    isFirst = false;
                }
                // Function to chain up data in DataTable
                resTbl = WriteInDataTable(resTbl, Jresult);

                offset += 20;
            }
            catch(Exception ex)
            {
                var msg = ex.Message;
            }
        }
    }

So the process flow I am taking is:

  1. Get data from API (let's say first 20 record).
  2. Add it two existing DataTable using WriteInDataTable function.
  3. Insert data into SQL Database from this resTbl Datatable (completely different process, not shown in this screenshot).

I haven't used parallel tasks yet (don't even know if it's a correct solution for it), so would appreciate any help.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
JmukhadzeT
  • 99
  • 1
  • 9
  • 1
    So I went to copy and paste your code to modify and answer your question... That worked out well didn't it? – TheGeneral Nov 16 '21 at 07:25
  • 1
    1. Measure, 2. Pinpoint your bottleneck(s), then go look how to fix them. – Fildor Nov 16 '21 at 07:26
  • First thing that stings my eye is that you are waiting for the DB write to finish before requesting the next batch. You could separate this and speed the process up. By how much depends on your db / connection speed. – Fildor Nov 16 '21 at 07:29
  • BTW: Using `TryParse` and ignoring its result is ... awkward. – Fildor Nov 16 '21 at 07:31
  • @Fildor WriteInDataTable function doesn't writes data in Database (it's just chains up results data in DataTable object - resTbl ) , it's a separated process . Also I had to delete some part of code , actually I am checking the result of TryParse , but it doesn't matters for this question thats why I removed it. – JmukhadzeT Nov 16 '21 at 07:34
  • Thereby you are creating false leads to the people trying to help you. If that part is irrelevant to the question, better leave it out completely. – Fildor Nov 16 '21 at 07:40
  • @Fildor This part isn't irrelevant , this is to show for what I am using the response data. It's a simplest version of the process I need to make . – JmukhadzeT Nov 16 '21 at 07:44
  • 2
    [Please do not upload images of code/errors](https://meta.stackoverflow.com/q/285551) but provide it as text – Klaus Gütter Nov 16 '21 at 07:45
  • Well, anyway. Did you _measure_ how much time `WriteInDataTable` consumes? – Fildor Nov 16 '21 at 07:47
  • Could you add your code in the question please :) – Tjaart Nov 16 '21 at 07:53
  • 1
    The fast solution would be to actually talk to the database instead of an HTTP API. If you can't do that, ask the DBA to export the data into a CSV you can import. If you can't do that, ask the service developers to prepare that CSV file in the background so you can download it later on. Only if you have no other alternative try to download rows 20 at a time. You're wasting 80% if not more of your time in roundtrips **AND wasting the server's CPU**. It will be a lot cheaper for everyone involved if they just exported a CSV file. 1500 records is no data at all – Panagiotis Kanavos Nov 16 '21 at 08:03
  • ^^ Great idea. Just to add: If CSV is not the optimal solution, XML or JSON would probably be just as good as CSV while providing different pros/cons. – Fildor Nov 16 '21 at 08:10
  • Why are reading and inserting serialized? You want to insert as soon as you've downloaded a page worth of products. – CodeCaster Nov 16 '21 at 08:19
  • @CodeCaster I am inserting data in Database after getting all of it . The WriteInDataTable function just adds the result to existing DataTable (resTbl) after each call . – JmukhadzeT Nov 16 '21 at 08:25
  • I am asking why you don't insert each page as soon as you've downloaded it. – CodeCaster Nov 16 '21 at 08:39
  • @CodeCaster Idk really , I thought this way it would be faster – JmukhadzeT Nov 16 '21 at 09:15
  • If you have one thread downloading data, another thread could start inserting data as it comes in. – CodeCaster Nov 16 '21 at 09:15
  • @CodeCaster Sure , I will try that too , Thanks . – JmukhadzeT Nov 16 '21 at 09:21
  • You wrote: „…will contain 20 record, from 0 to 20”. Not to 20, but to 19… ;) – Eru Oct 22 '22 at 03:41

4 Answers4

2

Get your first record and set the total first before the loop:

var data = await GetFSLData(offset.ToString(),"Products");

JObject Jresult = JObject.Parse(data);

Int32.TryParse(Jresult.SelectToken("total").ToString(),out total);         
                

In the next step you can then parallelize your tasks:

DataTable resTbl = CreateDt();
var downloadTasks = new List<Task>();
while (offset < total)
{
    downloadTasks.Add(GetFSLData(offset.ToString(),"Products"));
    offset += 20;
}

Then you can use Task.WhenAll to get the data

var httpResults = await Task.WhenAll(downloadTasks);
foreach (var jObjectResult in httpResults.Select(JObject.Parse)) 
{
  resTbl = WriteInDataTable(resTbl, Jresult);
}

Just some things to be aware of: You will be hitting that api with a lot of requests simultaneously, and it might not be a good idea. You could use TransformBlock and ActionBlock in the TPL dataflow library if you run into this problem. You can find more information on that here:

https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library

Tjaart
  • 3,912
  • 2
  • 37
  • 61
2

If you have upgraded to the .NET 6 platform, you could consider using the Parallel.ForEachAsync method to parallelize the GetFSLData invocations. This method requires an IEnumerable<T> sequence as source. You can create this sequence using LINQ (the Enumerable.Range method). To avoid any problems associated with the thread-safety of the DataTable class, you can store the JObject results in an intermediate ConcurrentQueue<JObject> collection, and defer the creation of the DataTable until all the data have been fetched and are locally available. You may need to also store the offset associated with each JObject, so that the results can be inserted in their original order. Putting everything together:

protected async void FSL_Sync_btn_Click(object sender, EventArgs e)
{
    int total = Int32.MaxValue;
    IEnumerable<int> offsets = Enumerable
        .Range(0, Int32.MaxValue)
        .Select(n => checked(n * 20))
        .TakeWhile(offset => offset < Volatile.Read(ref total));

    var options = new ParallelOptions() { MaxDegreeOfParallelism = 10 };
    var results = new ConcurrentQueue<(int Offset, JObject JResult)>();
    await Parallel.ForEachAsync(offsets, options, async (offset, ct) =>
    {
        string data = await GetFSLData(offset.ToString(), "Products");
        JObject Jresult = JObject.Parse(data);
        if (offset == 0)
        {
            Volatile.Write(ref total,
                Int32.Parse(Jresult.SelectToken("total").ToString()));
        }
        results.Enqueue((offset, Jresult));
    });

    DataTable resTbl = CreateDt();
    foreach (var (offset, Jresult) in results.OrderBy(e => e.Offset))
    {
        resTbl = WriteInDataTable(resTbl, Jresult);
    }
}

The Volatile.Read/Volatile.Write are required because the total variable might be accessed by multiple threads in parallel.

In order to get optimal performance, you may need to adjust the MaxDegreeOfParallelism configuration, according to the capabilities of the remote server and your internet connection.

Note: This solution is not efficient memory-wise, because it requires that all data are stored in memory in two different formats at the same time.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
1

You could use Task.WhenAll to run your requests in parallel.

    public async Task<IEnumerable<string>> GetDataInParallel()
    {
        var tasks = new List<Task<string>>();
        
        while(...)
        {
           var dataTask = GetFastLaneData(offset.ToString(), "Products"); // does not launch request, only add task to a list
           tasks.Add(dataTask);
           offset += 20
        }

        var datas = await Task.WhenAll(tasks); // launch all tasks

        return datas;
    }

This method will try to create or utilize 1000 threads and manage them, which might be harmful to performance, but will be significantly faster than launching them in order. You might consider batching them to achieve even better performance and launch like 100 tasks at a time.

xakpc
  • 1,709
  • 13
  • 27
  • 2
    So, basically a small-scale DoS attack ... – Fildor Nov 16 '21 at 07:59
  • yeah, exactly :) but to be honest almost any server should handle 100-1000 requests per second easily. DDoS attacks all about millions RPS – xakpc Nov 16 '21 at 08:17
  • 1
    I wasn't exactly 100% serious ;) But I think you got it. And while I agree that servers _should_ be able to handle it, it may well be that API throttles are likely to kick in. But OP didn't mention any, so that's only a guess. – Fildor Nov 16 '21 at 08:26
  • 1
    *"This method will try to create or utilize 1000 threads"* I beg to differ. If the `GetFastLaneData` is a correctly implemented asynchronous method, it will use [zero threads](https://blog.stephencleary.com/2013/11/there-is-no-thread.html) while running. A few `ThreadPool` threads (potentially only one) will be utilized briefly when each `Task` completes, to run the continuation attached internally by the `Task.WhenAll` method. – Theodor Zoulias Nov 16 '21 at 10:55
1

It's quite hard to know what you're really using and getting, due to the high abstraction level in your code (which is IMHO good, but quite hard to spot errors on a page like SO).

So here is just a sketch on how you can parallelize all requests to your API to improve the fetch time and write the results once into the database. Maybe there are some quotas on the API and you maybe have to run these things in chunks, but this can easily be adopted through LINQ.

var httpClient = new HttpClient();
var requests = Enumerable.Range(0, 1500)
    .Where(i => i % 20 == 0)
    // Create all needed requests
    .Select(offset => $"https://www.someDummyAPI.com/Api/Products?offset={offset}&count=20")
    .Select(url => new HttpRequestMessage(HttpMethod.Get, url))
    // Create tasks to call these requests
    .Select(request => httpClient.SendAsync(request));

// Run all of these requests in parallel.
var responses = await Task.WhenAll(requests);
// Create all tasks to get the content out of the requests
var allContentStreams = responses
    .Select(response => response.Content.ReadAsStringAsync());

// Retrieve all content bodies as strings
var allRawContents = await Task.WhenAll(allContentStreams);
// Serialize strings into some usable object
var allData = allRawContents
    .Select(JsonConvert.DeserializeObject<MyDataDTO>);

// Add all objects to the database context.
foreach (var data in allData)
{
    WriteIntoDatabase(data);
}

// Let context persist data into database.
SaveDatabase();
Oliver
  • 43,366
  • 8
  • 94
  • 151
  • I tend to lean toward @PanagiotisKanavos comment. It seems the Http API is a huge bottleneck. So, maybe it's better to figure out if it is possible to switch to something with better throughput. - Which of course somewhat evades the problem in the question rather than solving it. But I honestly doubt it _is_ solvable to a satisfying extent. – Fildor Nov 16 '21 at 08:21