6

I'm wondering how one would go about streaming data from SQL server using EF6.

Assume having these classes

  • PersonRepository
  • EFPerson (EF model)
  • DomainPerson (Domain model)
  • PersonUsingClass

Assume that PersonUsingClass depends on getting a bunch of DomainPersons. Assume that business rules dictate that EFPersons aren't allowed to leave the PersonRepository.

Normally I'd have a repository method looking like this:

    public IEnumerable<DomainPerson> GetPeople()
    {
        using (var db = new efContext())
        {
            IQueryable efPeople = db.Person.Where(someCriteria);

            foreach (var person in efPeople)
            {
                yield return person.ToDomainPerson();
            }
        }
    }

With the code I have here, everything would be loaded into memory upon doing the foreach. I could achieve the streaming by returning the IQueryable to the PersonUsingClass, but that exposes the EF models to that class, which is an unwanted scenario.

Is it really impossible to hide away the EF models while streaming data at the same time? Or is there something I don't know?

Treeline
  • 475
  • 1
  • 8
  • 23
  • 3
    No records are being loaded into memory in that method (unless EF is doing some caching internally). If you `foreach` over the result of that method then you are iterating over each record one by one. If you call .`ToList(), ToArray()` on the result of that method then records will be loaded into memory. – Oliver Jul 25 '19 at 14:18
  • 1
    Excellent! A team member had me convinced i was wrong in that we could stream data this way. You can find documentation for EF online that states "LINQ queries are always executed when the query variable is iterated over, not when the query variable is created". So in the end we figured that as soon as the foreach was hit, everything would be loaded into memory, and that yield return then wouldn't have any meaningful effect, other than deferring the mapping "ToDomainPerson". Please create your comment as an answer, then I can accept it. – Treeline Jul 26 '19 at 09:08
  • 1
    Just wanted to mention that when this is enabled builder.EnableRetryOnFailure(5), EF would then in fact load entire dataset in mem on the first iteration, as opposed to doing DateReader style with MoveNext. – Jack0fshad0ws May 05 '22 at 02:12

2 Answers2

5

The method you have created iterates over an IQueryable<> object that's created by EF.

The IQueryable<> variable has deferred execution, therefore internally, EF will only make the call to the database when the IQueryable<> is iterated over (i.e. when .MoveNext() is first called).

Additionally, if you have ever hand-rolled a database call using SqlDataReader, you'll see that it is possible to .Read() results of a query one-by-one, you don't need to load all records into memory. It's likely that the records are being streamed in this way by EF (this is an assumption on my part, it may depend on your specific EF setup).

Your method is returning an IEnumerable<> object which is also subject to deferred exeution. Creating an instance of this by calling GetPeople() will not lead to a database call.

When the result of your method is iterated over, you're then triggering the iteration over the internal IQueryable<> object and transforming the results one by one.

Therefore:

No records are being loaded into memory in that method (unless EF is doing some caching internally). If you iterate over the result of that method then you are iterating over each record one by one.

If you call .ToList() or .ToArray() on the result of that method then records will be loaded into memory.

Oliver
  • 8,794
  • 2
  • 40
  • 60
  • *unless EF is doing some caching internally* -- Well, it does and there are two ways to prevent this (see the other answer). – Gert Arnold Jul 31 '19 at 12:12
  • @GertArnold You're absolutely correct. My interpretation of the question led to my focus on the iteration aspect. I left the EF caching subject vague as there are many variables at play and the subject might even warrant an additional question. For example, tracking can be disabled at context level, you don't have to use `.AsNoTracking()`. – Oliver Jul 31 '19 at 12:45
  • *tracking can be disabled at context level*, only in ef-core though. – Gert Arnold Jul 31 '19 at 12:51
4

Entity Framework queries used to be buffering and could be made streaming by an AsStreaming extension method. But streaming has long been the default and the extension method still exists but is obsolete now (in EF6). That's one.

But don't forget EF's change tracker. By default, EF caches all entities it materializes in its change tracker, which is an identity cache. Therefore, even though the query is streaming, in order to prevent memory consumption you have to prevent EF from tracking entities. And that's exactly what's missing in your code.

Each iteration of the foreach loop attaches one Person instance to the change tracker.

Caching the entities can be prevented in two ways.

  1. Simply use db.Person.AsNoTracking().
  2. Project immediately. Projection creates objects of types that EF doesn't track.

The second method would look like:

var people = db.Person.Where(someCriteria).Select(p => p.ToDomainPerson());

But of course ToDomainPerson() can't be translated into SQL. Instead you should do something like:

db.Person.Where(someCriteria).Select(p => new DomainPerson
{
    Name = p.Name,
    ...
}
);

Or, better, use AutoMapper's ProjectTo method, which keeps your code just as DRY as this ToDomainPerson method.

The advantage of projecting immediately is that you only pull the required fields from the database and that no lazy loading will be triggered afterwards. Lazy loading can be a source of serialization problems or exceptions because the context is disposed when lazy loading is triggered.

Gert Arnold
  • 105,341
  • 31
  • 202
  • 291