53

Dapper dot net has a buffer parameter (a bool), but as far as I can tell the only thing it does is cast the result to a list before returning it.

As per the documentation:

Dapper's default behavior is to execute your sql and buffer the entire reader on return. This is ideal in most cases as it minimizes shared locks in the db and cuts down on db network time.

However when executing huge queries you may need to minimize memory footprint and only load objects as needed. To do so pass, buffered: false into the Query method.

I'm not sure how casting the result to a list accomplishes this. Am I missing something? My only idea is that it is supposed to set the CommandBehavior for the ExecuteReader to CommandBehavior.SequentialAccess (but it doesn't).

rethabile
  • 3,029
  • 6
  • 34
  • 68
smdrager
  • 7,327
  • 6
  • 39
  • 49
  • 1
    Duplicate of [Explanation of dapper buffer/cache](http://stackoverflow.com/questions/13026558/explanation-of-dapper-buffer-cache), which has has a slightly longer, detailed answer by @Marc (although, admittedly, was actually asked after this question). – vgru Oct 21 '14 at 11:21
  • @Groo, if the other question was asked after this one, then the *other* question is the duplicate, not this one. – kristianp Feb 27 '19 at 00:50

3 Answers3

55

but as far as I can tell the only thing it does is cast the result to a list before returning it

You aren't missing anything. That is the key difference. Except it isn't a cast as such: the actual returned object is very different. Basically, there are two ways of reading data:

  • in a streaming API each element is yielded individually; this is very memory efficient, but if you do lots of subsequent processing per item, mean that your connection / command could be "active" for an extended time
  • in a buffered API all the rows are read before anything is yielded

If you are reading a very large amount of data (thousands to millions of rows), a non-buffered API may be preferable. Otherwise lots of memory is used, and there may be noticeable latency before even the first row is available. However, in most common scenarios the amount of data read is within reasonable limits, so it is reasonable to push it into a list before handing it to the caller. That means that the command / reader etc has completed before it returns.

As a side note, buffered mode also avoids the oh-so-common "there is already an open reader on the connection" (or whatever the exact phrasing is).

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
9

I have to disagree with @chris-marisic on this... I ran into multiple "Out Of Memory" exceptions at that exact line (data.ToList()) when using buffered:true. It was not a "zillion rows X bazillion columns" query, just a regular 5-6k rows SQL-result with about 30 columns.

It really depends on your configuration. E.g. whether your SQL and IIS run on same physical machine or not. And how much memory is installed on the IIS machine, and what is the page file setting etc. If the web-server has 2 GB or less - consider setting "buffered:false" for super-heavy reports.

jazzcat
  • 4,351
  • 5
  • 36
  • 37
  • To elaborate on that: data.ToList() creates an list fully in memory anyway. So if you are using buffered:true, you will probably get the list in memory twice. – frankhommers Feb 10 '17 at 22:18
  • By `data.ToList()` I was referencing an actual line from dapper's source code. – jazzcat May 06 '17 at 19:35
  • That's a textbook example of why pagination should be used, has nothing to do with buffered vs unbuffered. – Chris Marisic Jul 31 '23 at 18:36
2

In practice it is better to never use buffered: false.

I've found reading even many millions of rows that it is both faster and more memory efficient to use buffered results than unbuffered. Perhaps there is a cross over point if your tables have 500 columns and you're reading 10s of millions or 100s of millions of rows.

If your result sets are smaller than many billions of values it is not worth using buffered: false for any reason.

I was shocked during actual analysis that reading gigabytes of data from Sql Server was both faster (2-6x faster) and more memory efficient in standard buffered mode. The performance increase even accounts for the most minute operation possible, adding an object to a sparse array by index to an array that does not resize. Using a multi-gigabyte sparse array switching from unbuffered to buffered saw the 2x improvement in load time. Writing to a dictionary using buffered saw the 6x improvement in load time when inserting millions of records (dictionary used the table's int PK as key so as basic of a hashcode calculation as possible).

As with everything regarding performance you always must analyze. However I can tell you with a very high level of certainty always start with the default buffered behavior of Dapper.

Chris Marisic
  • 32,487
  • 24
  • 164
  • 258
  • 2
    Chris, when you talk about populating the array or writing to a dictionary, are you talking about how you're handling results that Dapper has already returned? So technically that's profiling your own post-buffered-read code? Or are you talking about processing results from SQL Server at a lower level?. – Rich Jan 20 '16 at 16:20
  • @Rich Using `foreach(item in unbuffered) dictionary.Add(item.Id, item)` was 6x slower than `bufferedquery.ToDictionary(x=> x.Id)`. There is a definitive latency cost for using unbuffered. (I'm reading sets of millions and hundreds of millions, note i also used the properly sized capacity constructor for the dictionary) – Chris Marisic Jan 20 '16 at 18:08
  • So you are indeed talking about what I thought. Awesome. Very useful info, thanks. – Rich Jan 20 '16 at 19:08
  • 1
    Dunno what kind of workflow you are doing, but using `buffered = true` on millions of rows is very careless of resources. Deployed to limited resource machines, it may be very slow (due to paging) or just crash, for row size of typical business data. – Kasey Speakman Mar 31 '18 at 22:55
  • @KaseySpeakman and what if the business case is load a **b**illion records into memory? – Chris Marisic Apr 11 '18 at 18:24
  • @ChrisMarisic We use unbuffered for exporting data to CSV from low memory/cpu AWS server nodes. It could be a lot of data (wide rows), but the process still uses little memory. It is a great fit for that. – Kasey Speakman Apr 11 '18 at 19:08
  • @KaseySpeakman isn't there a sql server native way to export data to CSV? – Chris Marisic Apr 11 '18 at 19:14
  • @ChrisMarisic Perhaps, but I'm using Postgres and file gets saved directly to S3. – Kasey Speakman Apr 11 '18 at 22:33
  • @KaseySpeakman if you're writing the file to S3 aren't you buffering 100% of everything into memory already? I'm pretty sure S3 doesn't support block writing. – Chris Marisic Apr 12 '18 at 18:26
  • 3
    @ChrisMarisic We are using S3's [Multipart Upload](https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html) support. Where we upload pieces at a time and S3 combines them upon completion. Suffice it to say there are good use cases for both unbuffered and buffered Dapper queries. We use buffered for most API calls but we also limit result counts to avoid resource problems. – Kasey Speakman Apr 12 '18 at 18:36
  • 2
    @ChrisMarisic Ah, "my way or the highway." No worries. Onlookers can make their own judgements. – Kasey Speakman Apr 12 '18 at 21:10
  • Maybe things have changed since this answer was written, but the current Dapper code says: `return buffered ? results.ToList() : results;` And the `results` is `yield return`-ing of rows as they come in from the connection. What's your theory for how `d.ToList().ToDictionary() ` is 6x faster than `d.ToDictionary()`? – Daniel Earwicker Oct 10 '19 at 16:40
  • @DanielEarwicker likely creates lots of intermediary collections over and over again as the unknown length collection is being hydrated. It's also possible that there's alot more context switches as opposed to trying to get the io threads closed as finished asap with ToList. – Chris Marisic Oct 11 '19 at 14:48
  • @ChrisMarisic I'm pretty sure if you have data warehousing, or need to actually return billions of records, multiple times at the same time, your only option is to use unbuffered, unless of course you have 1 TB RAM. Everything exists for a reason, it's not something you just sign-off as being useless, or considered bad practice to be used. Obviously if you can buffer any data to the memory, it will be much faster than not buffering it, but sometimes you simply don't have the resources to do that, or considered foolish to do so. Everything can be abused. – SpiritBob May 15 '20 at 06:43
  • @SpiritBob for reference this was all measured against reading about 4 billion rows in several minutes, every few hours. And yes it was for a reporting engine to have access to all data in memory. Which if your data isn't large random strings, this is only a dozen or two dozen gigabytes of memory. Not 100 or 1000. – Chris Marisic May 22 '20 at 17:31
  • This is flat out bad advice. This completely depends on one's use case, and suggesting to never use `buffered: false` is honestly misleading. I have a query with ~8 millions of rows using Snowflake and the difference between this setting on or off is the difference between allocating 12gb of memory vs 120mb of memory. And there are very few columns (<15) on this particular query and they are all fairly small strings at worst. – julealgon Jul 21 '23 at 23:42
  • @julealgon that sounds like your query is materializing gigabytes of data for no purpose, `buffered: true` hit a home run here to show you that. – Chris Marisic Jul 31 '23 at 18:32