9

If I have, for example the following List<int>s

{ 1, 2, 3, 4 } //list1
{ 2, 3, 5, 6 } //list2
...
{ 3, 4, 5 }    //listN

What is the best way to retrieve the following corresponding List<int?>s?

{    1, 2,    3,    4,    null, null } //list1
{ null, 2,    3,    null, 5,    6    } //list2
...
{ null, null, 3,    4,    5,    null } //listN
dav_i
  • 27,509
  • 17
  • 104
  • 136
  • Iterate from 1 to 6 and check for each item if it's in the input list. Since the input list is sorted, you can do this in *O(n)*. – dtb Nov 22 '12 at 13:34
  • 1
    What is the use of such a list where `index-1 == value`? – L.B Nov 22 '12 at 13:35
  • @L.B.: I'm not sure what the use of this but in all fairness sometimes the value is null, not index. – Chris Nov 22 '12 at 13:36
  • The use of low `int` values was just for easy readability by way of example and should not be considered as the actual data to be used! – dav_i Nov 22 '12 at 13:38
  • @dav_i: perhaps you could explain your use case? At the moment you look like rather than having your new lists you are better off having a function that just takes a list and an index and generates the result (by just saying `originalList.Contains(index+1)` or similar. This will of course work for generating the full list too through a loop but seems much simpler not to generate the full lists... – Chris Nov 22 '12 at 13:41
  • @Chris just for the record, my solution does just that: `AlignSequences` is a _lazy_ enumerator. It doesn't store the result in a 'large list'. It just _yields_ 1 record at a time. – sehe Nov 22 '12 at 14:06
  • @dav_i For simple cases, you can employ the [`FullOuterJoin`](http://stackoverflow.com/a/13503860/85371) implementation I posted at an older question. For an idea, see http://ideone.com/mFLYhT - extrapolating for more lists, is left as an exercise to the reader – sehe Nov 22 '12 at 14:16
  • @dav_i: Added [another approach](http://stackoverflow.com/a/13514935/284240). – Tim Schmelter Nov 22 '12 at 14:53
  • @sehe: my point was more to go for a "random access" type thing rather than a sequential access. For example to find out what is in position 5 of the new list 1 you don't need to know what is in any previous positions but with any enumerator based thing it will need to look at all the previous values before getting to the one you want won't it? – Chris Nov 23 '12 at 10:28
  • @Chris `aligned[0].ElementAt(5)`? Has the **enormous** benefit that it won't have to do more than 5 `BinarySearch` calls, as it won't even evaluate any of the other result lists, and none of the positions in _new list1_ (`aligned[0]`) beyond index 5. See **[http://ideone.com/AghZ0k](http://ideone.com/AghZ0k)** – sehe Nov 23 '12 at 11:26
  • @sehe I get that it uses less than some of the alternatives. My points is though that if you the original list just contains `10000` and you ask for the item at `newlist[10000]` then you can work this out with a function that just says List.Contains(10000) and you're done. Your way involves working out the first 10000 elements of newlist which might not be that slow but is definitely not as good. Where the data in a list is a function of its position it just doesn't seem to make sense to use a list rather than just use the function. – Chris Nov 23 '12 at 11:46
  • @Chris granted BUT: That was not your question. Had it been, the answer would have been "_your `List<>` should have been `HashSet<>`"_ (as I've applied it partially in the 'hybrid' solutions (**[1](http://ideone.com/wcI872)**, and **[2](http://ideone.com/00DMux)** **[others](http://ideone.com/AghZ0k)**). It is all the usual trade-off of time and space. My own answer trades _near-zero memory overhead_ for linear time complexity (**but** it is lazy, so you only pay for what you use). I think my answer is still the best _generic_ approach to answer your original question _as posed_. – sehe Nov 23 '12 at 12:46
  • P.S. Re. "then you can work this out with a function that just says List.Contains(10000)" - Of course using `List.Contains` is horrific there. Make that _at least_ List.BinarySearch if you can't change to a set implementation altogether – sehe Nov 23 '12 at 12:49

2 Answers2

7

I'm posting the solution we discussed in chat. I had an unoptimized version using Linq for all things loopy/filtering:

However, I suspect it won't be too performant because of all the enumerator classes created, and the collections being instantiated/modified along the way.

So I took the time to optimize it into handwritten loops with an administration to keep track of active iterators instead of modifying the iters collection. Here it is:

See http://ideone.com/FuZIDy for full live demo.

Note I assume the lists are pre-ordered by DefaultComparer<T>, since I use Linq'sMin() extension method without a custom comparer

public static IEnumerable<IEnumerable<T>> AlignSequences<T>(this IEnumerable<IEnumerable<T>> sequences)
{
    var iters = sequences
        .Select((s, index) => new { active=true, index, enumerator = s.GetEnumerator() })
        .ToArray();

    var isActive = iters.Select(it => it.enumerator.MoveNext()).ToArray();
    var numactive = isActive.Count(flag => flag);

    try
    {
        while (numactive > 0)
        {
            T min = iters
                .Where(it => isActive[it.index])
                .Min(it => it.enumerator.Current);

            var row = new T[iters.Count()];

            for (int j = 0; j < isActive.Length; j++)
            {
                if (!isActive[j] || !Equals(iters[j].enumerator.Current, min)) 
                    continue;

                row[j] = min;
                if (!iters[j].enumerator.MoveNext())
                {
                    isActive[j] = false;
                    numactive -= 1;
                }
            }
            yield return row;
        }
    }
    finally
    {
        foreach (var iter in iters) iter.enumerator.Dispose();
    }
}

Use it like this:

public static void Main(string[] args)
{
    var list1 = new int?[] { 1, 2, 3, 4, 5 };
    var list2 = new int?[] { 3, 4, 5, 6, 7 };
    var list3 = new int?[] { 6, 9, 9 };

    var lockstep = AlignSequences(new[] { list1, list2, list3 });

    foreach (var step in lockstep)
        Console.WriteLine(string.Join("\t", step.Select(i => i.HasValue ? i.Value.ToString() : "null").ToArray()));
}

It prints (for demo purposes I print the results sideways):

1       null    null
2       null    null
3       3       null
4       4       null
5       5       null
null    6       6
null    7       null
null    null    9
null    null    9

Note: You might like to change the interface to accept arbitrary number of lists, instead of a single sequence of sequences:

public static IEnumerable<IEnumerable<T>> AlignSequences<T>(params IEnumerable<T>[] sequences)

That way you could just call

var lockstep = AlignSequences(list1, list2, list3);
sehe
  • 374,641
  • 47
  • 450
  • 633
1

Here's another approach using List.BinarySearch.

sample data:

var list1 = new List<int>() { 1, 2, 3, 4 };
var list2 = new List<int>() { 2, 3, 5, 6, 7, 8 }; 
var list3 = new List<int>() { 3, 4, 5 };
var all   = new List<List<int>>() { list1, list2, list3 };

calculate min/max and all nullable-lists:

int min = all.Min(l => l.Min());
int max = all.Max(l => l.Max());
// start from smallest number and end with highest, fill all between
int count = max - min + 1;  

List<int?> l1Result = new List<int?>(count);
List<int?> l2Result = new List<int?>(count);
List<int?> l3Result = new List<int?>(count);

foreach (int val in Enumerable.Range(min, count))
{
    if (list1.BinarySearch(val) >= 0)
        l1Result.Add(val);
    else
        l1Result.Add(new Nullable<int>());

    if (list2.BinarySearch(val) >= 0)
        l2Result.Add(val);
    else
        l2Result.Add(new Nullable<int>());

    if (list3.BinarySearch(val) >= 0)
        l3Result.Add(val);
    else
        l3Result.Add(new Nullable<int>());
}

output:

Console.WriteLine(string.Join(",", l1Result.Select(i => !i.HasValue ? "NULL" : i.Value.ToString())));
Console.WriteLine(string.Join(",", l2Result.Select(i => !i.HasValue ? "NULL" : i.Value.ToString())));
Console.WriteLine(string.Join(",", l3Result.Select(i => !i.HasValue ? "NULL" : i.Value.ToString())));

1,      2,      3,      4,      NULL,   NULL,   NULL,   NULL
NULL,   2,      3,      NULL,   5,      6,      7,      8
NULL,   NULL,   3,      4,      5,      NULL,   NULL,   NULL

DEMO

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • +1 - note that this indeed incurs the cost of storing the aligned lists in full. Just a note in case it matters (in case of large lists) – sehe Nov 22 '12 at 14:51
  • okay, I just read it a little more closely: this depends on both min and max, ignores the fact that the lists be pre-sorted. The `all` list serves no actual purpose (`Math.Max()` would be a lot faster). It might traverse a lot more indices then are present, resulting in highly sparse output (imagine `6` would not have in in any list, or list2 contained `91467`?). It is hardcoded for 3 input lists, and the input type of `int?`. Not to mention you'll be doing `3*(max-min+1)` binary searches which are unnecesarry. – sehe Nov 22 '12 at 16:22
  • @sehe: Ok, the `Min/Max` part is unnecessary. I's more efficient to look only at the first index of each list for min and at the last for max. I could use `Enumerable.Concat` instead to get `all`. But we're talking of milliseconds, i wanted to keep the code readable. Mabe i have understood the question differently. My interpretation: take the smallest number and the highest number of all lists, look into each list for every number in this range and see it exists in that list, if not add `int?` otherwise the actual value. – Tim Schmelter Nov 22 '12 at 16:36
  • @sehe: Note also that OP's desired result are three `List`(in this example). So consider an example where the first list is `Enumerable.Empty().ToList()` and the second contains 1M items, the correct result would be two `List` where each list has 1M elements. – Tim Schmelter Nov 22 '12 at 16:42
  • Fair point, about the output presentation. Of course, that'd be trivially extracted from my lazy version, as opposed to vice versa. Anyways, I feel you didn't get the points I was trying to make: **[http://ideone.com/00DMux](http://ideone.com/00DMux)** – sehe Nov 22 '12 at 18:14
  • However, if you wanted to do all that (a) with the desired output format (b) lazily (c) without listing all _ununused_ intermediate values and (d) without unnecessary binary searches (e) without hardcoding the number of lists: **[http://ideone.com/wcI872](http://ideone.com/wcI872)** (PS. note that it is a generic function too, and in 15 fewer lines of code). – sehe Nov 22 '12 at 18:18