0

I have several "data sources", each of which provides ordered timestamped data. I'd like to flatten it into a single ordered stream (like merge sort). This answer describes how to do it for two enumerables, but I am not sure how to generalize it.

Data sources are huge, so I cannot do it in memory, it has to be streamed.

To explain it with an example, I have something like this:

interface IDataSource
{
    IEnumerable<DateTime> GetOrderedRecords();
}

I would like to be able to have an extension method like this:

// get all sources
IEnumerable<IDataSource> dataSources = GetAllSources();

// merge sort
IEnumerable<DateTime> flattened = dataSources
    .MergeSort(s => s.GetOrderedRecords());

[Edit]

The reason I can't load everything eagerly and then sort it is because I am loading data from multiple databases and exporting it into a different one. Each IDataSource is basically Linq-to-NHibernate under the hood, and I have millions of data rows to return.

So what I need is something like:

  1. From all available sources, load the next timestamp.
  2. Store it to disk and "forget it".

Data sources are already sorted, which makes the "merge sort" approach feasible.

Community
  • 1
  • 1
Lou
  • 4,244
  • 3
  • 33
  • 72
  • What do you mean by "I cannot do it in memory, it has to be streamed."? The way your interface is defined, it looks like it returns data that are completely in memory after the method returned. Is this correct? – Daniel Hilgarth Oct 10 '12 at 09:58
  • @Daniel: it returns an `IEnumerable`, instead of a list or a collection. What I meant was, it should be merged by enumerating sources, rather than copying them to multiple arrays and then merging using loops. And `IEnumerable` is lazy-evaluated, so after the method is returned, it's not like "data is completely in memory". – Lou Oct 10 '12 at 10:01
  • I don't understand why `Concat.OrderBy` wasn't a solution for the OP of the linked question, and I don't understand it in your case either. – Daniel Hilgarth Oct 10 '12 at 10:03
  • I know that `IEnumerable`s ***can*** be lazy-evaluated, but if you would fetch data from a database you would more naturally return `IQueryable`. – Daniel Hilgarth Oct 10 '12 at 10:04
  • @Daniel: well, for example, presume that you have 10 databases open, and have their `DataReader`s wrapped in `IEnumerable`s, and that you're mass-exporting merge-sorted data to a different table. With `IEnumerable`, only a single item (`IEnumerable.Current`) is actually present in memory at a given moment. – Lou Oct 10 '12 at 10:04
  • 1
    That's what I was asking. You didn't make that clear in the question. – Daniel Hilgarth Oct 10 '12 at 10:06
  • @Daniel: yeah, you're right, I'll update the question to make it clearer. – Lou Oct 10 '12 at 10:07
  • possible duplicate of [Most efficient algorithm for merging sorted IEnumerable](http://stackoverflow.com/questions/2767007/most-efficient-algorithm-for-merging-sorted-ienumerablet) – cdiggins Jan 21 '13 at 18:42

1 Answers1

2

One simple thing you could do is to concat the calls to the Merge implementation from the question you linked:

public static IEnumerable<DateTime> Merge(this IEnumerable<IDataSource> dataSources)
{
    var result = Enumerable.Empty<DateTime>();

    foreach(var dataSource in dataSources)
    {
        result = result.Merge(dataSource.GetOrderedRecords(), (x, y) => x < y);
    }

    return result;
}

You would call it like this:

var result = dataSources.Merge();

This has the drawback that each call to MoveNext on the enumerator of the returned enumerable yields quite a lot of MoveNext calls on the nested enumerables.

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
  • Thanks! This looks promising. I am not that concerned about nested calls, the most important thing is that data sources are not evaluated numerous times (and on first sight, it looks like that's true because of lazy evaluation). – Lou Oct 10 '12 at 10:43
  • @Dilbert: Yes, this code should lead to the desired behavior that for each data source only `Current` is in memory. – Daniel Hilgarth Oct 10 '12 at 10:46