2

I have an API that needs to return large lists of JSON data. I'm trying to stream it directly from the database to the client, to avoid hogging RAM on my web server. Would this be a good way to do it? (Seems to be working)

[HttpGet]
[Route("data")]
public IEnumerable<MappedDataDto> GetTestData()
{
    var connection = new NpgsqlConnection(_connectionString);
    connection.Open();

    IEnumerable<RawDataDto> rawItems = connection.Query<RawDataDto>("SELECT * FROM sometable", buffered: true);

    foreach (var rawItem in rawItems)
    {
        var mappedItem = Map(rawItem);
        yield return mappedItem;
    }
}
  • Do I need to disponse the connection or will that automatically be taken care of? I can wrap it in i using block, since that throws an exception like "can't access a disposed object"
  • Would it be better to use some kind of stream instead of yield return?

EDIT: In my case we have a large legacy javascript web app that shows graphs and charts of data between two dates. The client downloads the entire dataset and does in memory calculations on the client to represent that data in different kind of ways (the users can create their own custom dashboards, so not every one uses the data in the same way).

Our problem is that when clients request a period with a lot of data, the memory consumption of our ASP.NET Core API increases a lot. So this is what we are trying to prevent.

Unfortunately, making any larger changes to the client would be very time consuming, so we are looking at what we can do on the API side instead.

So this is why I'm trying to figure out if there is a way to stream the data from the database through the API. So that there will need to be no changes on the client, and the memory consumtion of the API will not be as bad since it won't have to hold everything in memory.

Joel
  • 8,502
  • 11
  • 66
  • 115
  • 2
    You should always dispose of unmanaged objects when they implement IDisposable. This would be automatic if you wrapped the connection in a *using* block. The answer to "would this be a good way to do it" seems subjective, but it's definitely one way to do it. – h0r53 Jun 13 '18 at 13:54
  • @CaitLANJenner Sorry, forgot to mention that if I do use a `using` block, I get an exception that I'm trying to access a disposed object (when the framework is iterating over the results) – Joel Jun 13 '18 at 13:56
  • 1
    @DanWilson on the contrary, using `using` here will result in an exception because the object gets disposed *before* serialization can even start. *This* code can't work as it is – Panagiotis Kanavos Jun 13 '18 at 13:58
  • @DanWilson Not trying to get an opinion :) I'm trying to find out if would be possible to do it in a more efficient way, like using a stream for example – Joel Jun 13 '18 at 13:58
  • @Joel your code doesn't even run. You have to make it work *first*, then try to optimize it – Panagiotis Kanavos Jun 13 '18 at 13:59
  • @PanagiotisKanavos Doesn't it? Realized there is a typo, I will fix that. But I have a running example that I've simplified as much as possible – Joel Jun 13 '18 at 14:00
  • To answer the question of "will that automatically be taken care of" - yes, garbage collection will eventually do its job. However, generally you should not rely solely on garbage collection to clean your mess. There could be performance implications otherwise. – h0r53 Jun 13 '18 at 14:02
  • Also, it seems like the thing to dispose here would be the connection itself, not the rawItems data. You should still be able to dispose the connection and then iterate the loop. Just declare IEnumerable outside the using block - or manually call connection.Dispose() – h0r53 Jun 13 '18 at 14:05
  • @CaitLANJenner That isn't correct. But I realized that I can wrap the `using` around the `yield return` to avoid the "disposed" error. – Joel Jun 13 '18 at 14:09
  • 1
    How large are the lists or data to send? I'd recommend implementing paging, and let the client application request the data as needed. It depends on your requirements though, would you mind explaining your functionality? – Rui Jarimba Jun 13 '18 at 14:14
  • @RuiJarimba Depends. The caller can request data between two dates. Can be 1 data point, can be 1 million. Paging isn't really an option since we'd like to minimize the number of times the sql query is run (and other factors as well) – Joel Jun 13 '18 at 14:19
  • 1
    @RuiJarimba is right. Paging is the appropriate path. The likely reason you want to limit the times the SQL Query is run is because it takes a long time and/or is resource heavy. Paging partially solves that issue, as you don't need to query all the data at once. Aside from that, you should be implementing a stored procedure and/or finding other ways to optimize the query on the database end. Regardless, you can't return JSON and stream at the same time. All the data will have to be loaded into memory *eventually* to serialize it to a JSON string. – Chris Pratt Jun 13 '18 at 14:39
  • @ChrisPratt Hm ok. But how come I see a spike in memory usage if I use `buffered: true` for the Dapper query, but not when I use `buffered: false`. I thought that was an indication that the data was mapped to JSON and returned in chunks somehow. I don't really know how this works, hence the question :) – Joel Jun 13 '18 at 14:42
  • @Joel could you please edit your answer and describe what is your scenario? There is a massive difference between requesting data to be displayed in a website, mobile app etc and requesting data for other purposes - e.g. reporting, background job that will process each and every object, etc – Rui Jarimba Jun 13 '18 at 15:05
  • 1
    You've got multiple moving parts here, so memory utilization will vary based upon how each piece does its thing. However *true* streaming means sending `Transfer-Encoding: chunked` and actually feeding the client bytes as they're available. That will never happen here. ASP.NET Core will always need to fully materialize the query results in order to create the JSON response. – Chris Pratt Jun 13 '18 at 15:24
  • @RuiJarimba I added some extra info about why I'm trying to do what I do. – Joel Jun 14 '18 at 06:39
  • @Joel if the data is being displayed on graphs/charts maybe you should try to group the data somehow - daily, weekly, monthly (whatever) in order to reduce the amount of data that is being sent to the client application. For example, instead of 1 million records that correspond to a 30 days period, group the data by day, what do you think? – Rui Jarimba Jun 14 '18 at 10:39
  • 1
    @RuiJarimba That is one of the options I have further down my list of things to try. It's further down on the list because I would have to rewrite a lot of logic on the client side. – Joel Jun 14 '18 at 10:54

1 Answers1

1

Given the functionality (displaying data in charts/graphs) I'd suggest some changes to both the client and the server application.

Let's assume this case scenario:

A client requests data for a period of 30 days, which corresponds to 1 million rows

This means there will be a big memory consumption not only in the server but also on the client application! So I'd suggest rewriting the query to group the data by day, hour or even a shorter time period, whatever suits your needs - this would reduce the amount of data being sent from the server:

  • grouping by day: 30 records
  • grouping by hour: 720 records
  • grouping by 10 minutes range: 4320 records
  • grouping by minute: 43200 records

Client application would obviously need some changes in order to do the calculations based on the grouped data, not each individual row.

BTW I don't know which RBDMS you're using, but this might be helpful (SQL Server): How to group time by hour or by 10 minutes

Rui Jarimba
  • 11,166
  • 11
  • 56
  • 86