17

I have a class that handles all database methods, including Entity Framework related stuff. When data is needed, other classes may invoke a method in this class such as

public List<LocalDataObject> GetData(int start, int end);

The database is querying using LINQ to EF and the calling class can then iterate over the data. But since other classes have no access to the entities in EF, I need to perform a "ToList()" operation on the query and by that fetching the full dataset into memory.

What will happen if this set is VERY large (10s-100s of GB)?

Is there a more efficient way of doing iteration and still maintain loose coupling?

Saul
  • 471
  • 1
  • 5
  • 12

5 Answers5

22

The correct way to work with large datasets in Entity framework is:

  • Use EFv4 and POCO objects - it will allow sharing objects with upper layer without introducing dependency on Entity framework
  • Turn off proxy creation / lazy loading to fully detach POCO entity from object context
  • Expose IQueryable<EntityType> to allow upper layer to specify query more precisely and limit the number of record loaded from database
  • When exposing IQueryable set MergeOption.NoTracking on ObjectQuery in your data access method. Combining this setting with turned off proxy creation should result in not cached entities and iteration through result of the query should always load only single materialized entity (without caching of loaded entities).

In your simple scenario you can always check that client doesn't ask too many records and simply fire exception or return only maximum allowed records.

Ladislav Mrnka
  • 360,892
  • 59
  • 660
  • 670
  • 1
    I think "large" you talk about and "large" Saul is talking about are quite different. I dont think his problem can be solved by turning off tracking. In the end all data for this single entity are still loaded into memory. – Euphoric May 08 '11 at 13:03
  • That depends how the calling method uses data. Sure if he run the unbounded query and call `ToList` there will be no way to avoid loading all data to memory but if he only iterate the result and use only single record at time he can work with dataset returning few GBs. – Ladislav Mrnka May 08 '11 at 13:06
  • 1
    But that would require more than you are suggesting. EF will still load all data into memory. Or is there hidden function inside EF, that I am not aware of, that will internaly page data? – Euphoric May 08 '11 at 13:09
  • I don't think that EF loads all data to memory if you turn of tracking and lazy loading. Iterating `IQueryable` still uses `DbDataReader` internally so I believe it loads only single entity at time. – Ladislav Mrnka May 08 '11 at 13:11
  • @Ladislav But I don't want to share the entities between classes. I am mapping them to common POCO objects and that's why I can't use direct iteration – Saul May 08 '11 at 13:54
  • @Ladislav Mrnka, i did a simple Northwind sample using your spec here, and used SQL Profiler, and i saw that the query against the database completed right on the first record touched by the `foreach`. (i set a breakpoint there). are you sure the objets are not all in memory , i cant find a doc on that behaviour ? – Menahem May 08 '11 at 14:16
  • @Saul: Then your concerns are pretty valid because your architecture will not be able to solve that. – Ladislav Mrnka May 08 '11 at 14:25
  • @Menahem: No I don't have any link to doc describing this. It is only my very strong belief that it works this way because if doesn't we can all delete our solutions build on EF and start over with another technology. – Ladislav Mrnka May 08 '11 at 14:47
  • well, i`ll try and do this same test on a table with a large number of rows. i will break on the first iteration, and if the bytes in all heaps go up by a large number i`ll know the answer. i`m taking this seriously since we probably are goind to build on EF in a big project soon. – Menahem May 08 '11 at 15:25
  • @Menahem: Ladislav is right. If you query for the data with the `NoTracking` flag (or `AsNoTracking()` extension method in EF 4.1) and you don't have proxies and lazy loading, the ObjectContext remains empty while you are loading the data. You can check this with `DbContext.ChangeTracker.Entries().Count()` (in EF 4.1) (Attention: This call is very slow, only suited for testing). It will always be 0. I have a test project where I iterate through over half a million records in a table in the way as described by Ladislav. It needs 1,5 sec and memory consumption doesn't grow. – Slauma May 08 '11 at 19:40
  • @Ladislav Thanks, that makes sense. Any chance you can supply an example or point to one - especially for implementing the 3rd part? – Saul May 09 '11 at 06:26
  • @Ladislav Mrnka, i have run a test case looping over an 8 miliion row table, and printed out data from each row, and indeed it is as you say. The memory usage was almost constant (except the strings allocated and collected for the printing) and never more than 5MB. Thanks for an important lesson! – Menahem May 11 '11 at 06:37
  • Turning off tracking did the trick. Now the `IQueryable` behaves like a stream when passing into a csv helper. Only one object in memory at a time. – Bruno Zell Oct 04 '18 at 16:53
6

As much as I like EF for quick/simple data access, I probably wouldn't use it for such a scenario. When dealing with data of that size I'd opt for stored procedures that return exactly what you need, and nothing extra. Then use a lightweight DataReader to populate your objects.

The DataReader provides an unbuffered stream of data that allows procedural logic to efficiently process results from a data source sequentially. The DataReader is a good choice when retrieving large amounts of data because the data is not cached in memory.

Additionally, as far as memory management goes, of course make sure you wrap your code handling unmanaged resources in a using block for proper disposal/garbage collection.

You also may want to consider implementing paging.

Kon
  • 27,113
  • 11
  • 60
  • 86
  • Whenever you introduce a third-party library (I'd still consider EF that) into the equation, you'll always have some level of overhead which impacts efficiency/performance. With that said, I don't know, I've never looked into that. – Kon May 08 '11 at 13:57
4

Just a quick note about this point:

But since other classes have no access to the entities in EF, I need to perform a "ToList()" operation on the query and by that fetching the full dataset into memory.

The problem you encounter here is in my opinion not related to EF at all. What would you do if you wouldn't use EF for data access but raw ADO.NET? "Having no access to EF" translates then to "Having no access to a database connection".

If you have services or methods which must work with large amounts of objects but cannot access the database - either via EF or another kind of database connection - you must load the data into memory before you pass them to those services/methods. You may then think about solutions to buffer the loaded data somehow on hard disk on client side, but that has nothing to do with the source of the data and how you retrieve them. If you don't buffer and load everything into memory you are limited by your available memory. There is no way to escape from this limitation, I think.

If you have a connection to EF, I think the solution provided in Ladislav's answer is the right one as the settings and procedure he described reduce EF almost to the behaviour of a simple DataReader.

Slauma
  • 175,098
  • 59
  • 401
  • 420
0

I would use lazy IEnumerable and implement some kind of paging for you data internaly.

Maybe create your own IEnumerable interface, so user of your library is not tempted to call ToList on it himself.

But question is.. is it really good way to hide fact, that user of this data layer would work with such ammounts of data? First thing to do is to limit returned data to bare minimum. Dont return whole entity, but only parts, that you really need. Are you eager-fetching any related entities? Did you thought about using lazy loading?

Euphoric
  • 12,645
  • 1
  • 30
  • 44
0

One way to be certain is to always set an upper threshold that you will return to avoid gigantic set by ending your query with .Take(MAX_ROWS). This can be a workaround or a preventive move from a bad call that take down your service but best is to rethink the solution

Fadrian Sudaman
  • 6,405
  • 21
  • 29