3

I've been wondering; in the company I work for, we manage lots of data, but as it's effectively given to us by customers, we don't necessarily trust it - with good reason. A lot of it has the wrong timestamp, or some of it is missing, or whatever else have you.

One of the tasks that I have had to do a lot recently is basically find elements that are null within a set of elements, then find the next non-null element, then average out the difference between those null records. That is, say we have dataset A:

A = { 0f, 1f, 2f, 5f, Null, Null, Null, 7f, Null, 8f }

It's important to note that we have to distinguish between 0 and Null. The difference is obviously that 0 is 0, while Null is no data at all.

Using LINQ, is there a way that we can basically access the following subsection of A:

Subsection { Null, Null, Null, 7f }

And have it in a collection such that we can transform it into (7/4f) over the four records..

Subsection { 1.75f, 1.75f, 1.75f, 1.75f }

Such that when iterating over A again, we get the following output:

{ 0f, 1f, 2f, 5f, 1.75f, 1.75f, 1.75f, 1.75f, 4f, 4f }

Currently the way I do this is do a pass using a numeric for, looking for a null element, then storing all consecutive nulls in a List<T>, and after finding the next non-null, assigning all variables by iterating over said List<T>. It does the job but it looks pretty nasty.

So, for the sake of narcicissim, is there a way of doing this neatly (= less code clutter)?

Pseudo

a = { 0, 1, 2, 5, null, null, null, 7, null, 0 }


nullList = new List()
for i = 0, a.length
    if i == null
        nullList.add(i)
    else
        if nullList.length > 0
            nullList.add(i)
            int avg = nullList.Aggregate(x => x)
            foreach element in nullList
                element = avg
            nullList.clear()
Servy
  • 202,030
  • 26
  • 332
  • 449
Dan
  • 10,282
  • 2
  • 37
  • 64
  • 1
    Can you post some of your actual code? – JMK Oct 15 '13 at 14:21
  • Would the same rule apply to `Null, 8f` i.e. `4f, 4f`? – James Oct 15 '13 at 14:23
  • @James - yes, I corrected it – Dan Oct 15 '13 at 14:25
  • 1
    @JMK: posted pseudo, unless you want my current code? I'm not trying to fob off my work to you guys by the way, this is more of a curiosity than anything else. as I mentioned I already have a solution, which follows the pseudo-code posted – Dan Oct 15 '13 at 14:26
  • 2
    what should happen if there is a trailing `null` or sequence of nulls? – Servy Oct 15 '13 at 14:30
  • 4
    Whatever you do in LINQ is going to be much uglier than a `for` loop, because LINQ is better at working with sets than with ordered sequences. – Sergey Kalinichenko Oct 15 '13 at 14:31
  • i curl into a ball and cry. haven't come across that situation yet so i havent'had to think about it. – Dan Oct 15 '13 at 14:31
  • @dasblinkenlight not something i was aware of. just been really taken by linq recently due to how helpful it has been in my work (because we use sets mainly) – Dan Oct 15 '13 at 14:31
  • 3
    @DanPantry Honestly, your iterative solution isn't really that bad. and is probably going to be nicer than LINQ based solutions. – Servy Oct 15 '13 at 14:32
  • Not being linq-guru enough to say for sure, it may not be the tool of choice. Much like SQL, being a set-based language tends to imply that the order of order the set is meaningless so there's little built in to handle what it sees as useless. – John Spiegel Oct 15 '13 at 14:33
  • Maybe you should just post your code on codereview.stackexchange.com – sloth Oct 15 '13 at 14:35
  • 3
    @JohnSpiegel LINQ is actually sequence based, not so much set based. Most LINQ operators actually preserve order unless there's a compelling reason for them not to. That said, it's still not a particularly well suited problem for LINQ, given the operators available. – Servy Oct 15 '13 at 14:39
  • Also gotta mention that `.Aggregate` doesn't like to take in nulls, and it's pretty messy to write a statement that has to account for the fact that both "this" and "next" records might be null – Dan Oct 15 '13 at 14:41
  • @DominicKexel I didn't even know that SO existed. – Dan Oct 15 '13 at 14:41
  • You might check out Interactive Extensions (Ix - derived from Rx), which offers more LINQ methods. I'm not sure if there's anything that would address your specific question. See http://www.infoq.com/news/2011/07/Ix – TrueWill Oct 15 '13 at 14:48

6 Answers6

2

If I understand your question correctly, you want to replace the null values in the list with a value based on the first non-null value. I don't see why you'd need a second list of nulls for this. Here's an attempt to just modify the list in-place, although it's not much shorter than what you already have:

var A = new List<float?> { 0f, 1f, 2f, 5f, null, null, null, 7f, null, 8f };

for (int i = A.IndexOf(null); i != -1; i = A.IndexOf(null, i))
{
    int j = 0;
    do { j++; } while (A[i + j] == null);
    float f = A[i + j].Value / (j + 1);
    do { A[i++] = f; } while (j --> 0);
}

// A == { 0f, 1f, 2f, 5f, 1.75f, 1.75f, 1.75f, 1.75f, 4f, 4f }

The code repeatedly searches the list for a nulls (continuing where it left off when it found a null previously), counts the number of nulls next to each other, and then distributes the first non-null value across the gap. The code assumes that there is always a non-null value after each gap.

As pointed out in numerous comments, the use of LINQ does not provide any real advantage here.

dtb
  • 213,145
  • 36
  • 401
  • 431
  • I think you misunderstood - I don't need a second list (nor do I create/use one), I was just wondering if there was a more elegant way than the messy state management (for want of a better word) my current approach takes. I do currently modify the elements in place (gotta love reference types) – Dan Oct 15 '13 at 14:32
  • I think you need to divide by `j+1` since j is 0 based – Harrison Oct 15 '13 at 14:33
  • 1
    @DanPantry: In your pseudo-code I see two lists: `a` and `nullList`. My answer does not have a `nullList`, but apart from that basically does the same as your pseudo-code. I don't know if that is less messy to your eyes or not :-) – dtb Oct 15 '13 at 14:39
  • `while (j --> 0);` [while j goes to 0?](http://stackoverflow.com/questions/1642028/what-is-the-name-of-this-operator) – Ahmed KRAIEM Oct 15 '13 at 14:41
  • @AhmedKRAIEM: while j goes to 0, exactly! Thanks for adding the link; I was already looking for it in case someone asks about the --> operator :-) – dtb Oct 15 '13 at 14:43
  • Ah, now I get you. I was using nullList to maintain the state of nulls, as it were, it was the more intuitive solution to me at the time (9am without coffee) – Dan Oct 15 '13 at 14:44
  • --> is funky. never seen that before. – Dan Oct 15 '13 at 14:45
2

So first off we'll use a helper method called GroupWhile. It will take in a sequence and a function; that function will be given the previous item and the current item and based on that will determine if the current item should be part of a new group, or part of the previous group. It lets us group items while some condition is met:

public static IEnumerable<IEnumerable<T>> GroupWhile<T>(
    this IEnumerable<T> source, Func<T, T, bool> predicate)
{
    using (var iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
            yield break;

        List<T> list = new List<T>() { iterator.Current };

        T previous = iterator.Current;

        while (iterator.MoveNext())
        {
            if (predicate(previous, iterator.Current))
            {
                list.Add(iterator.Current);
            }
            else
            {
                yield return list;
                list = new List<T>() { iterator.Current };
            }

            previous = iterator.Current;
        }
        yield return list;
    }
}

Using this we can group items while the previous item is null. We then take each group, repeat the average value of that group group.Count() times, and then flatten the sequence out again:

public static IEnumerable<float> ConsolodateNulls<T>(IEnumerable<float?> source)
    where T : struct
{
    return source.GroupWhile((prev, curr) => prev == null)
        .SelectMany(group => Enumerable.Repeat(
            group.LastOrDefault(item => item != null) ?? 0 / group.Count(),
            group.Count()));
}
Servy
  • 202,030
  • 26
  • 332
  • 449
  • This is the sort of thing I was looking for, I'm not 100% on iterators myself which is probably why I couldn't write it myself – Dan Oct 15 '13 at 14:40
  • @DanPantry Honestly, I'd go with the solution you had to start with. I had an implementation of `GroupWhile` already sitting around to re-use, which makes this not *so* bad. If you didn't, it'd be *way* more work trying to write this. – Servy Oct 15 '13 at 14:41
  • The solution I wrote involves a lot of copy pasting, that's the thing. At least with LINQ it looks a lot nicer in an extension method and can be shoved onto `IEnumerable` and could be used in other things outside of this scope. I appreciate the for loop **could** be used in an Extension method but it feels a lot more niché if that makes sense. This sort of thing happens all around the legacy code and I'm the main code monkey in our IT team :( – Dan Oct 15 '13 at 14:43
  • If you are going to generalize it into a `GroupWhile()`, I would maintain consistency with LINQ and use `IEqualityComparer`. – Cory Nelson Oct 15 '13 at 14:43
  • @CoryNelson What would I need a comparer for? It takes in a predicate to handle the logic, and that predicate can do whatever it wants. In this usage it's not even comparing the two items. – Servy Oct 15 '13 at 14:45
  • @Servy Apologies, early morning misinterpretation. You are correct. – Cory Nelson Oct 15 '13 at 15:37
1

You can create an extension method that does this. This extension method can then be used in a normal LINQ statement:

public static IEnumerable<float> SmoothGaps(this IEnumerable<float?> source)
{
    int numberOfNulls = 0;
    foreach(var item in source)
    {
        if(item == null)
        {
            ++numberOfNulls;
        }
        else
        {
            if(numberOfNulls != 0)
            {
                for(int i=0; i <= numberOfNulls; ++i)
                    yield return item.Value / (numberOfNulls + 1);
            }
            else
                yield return item.Value;
            numberOfNulls = 0;
        }
    }
}

Usage would be simply:

var result = a.SmoothGaps();

nulls at the end of source will simply be removed.

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
  • 1) Why manually iterate the sequence rather than using a `foreach`? 2) You're not disposing of the enumerator 3) this probably shouldn't be an extension method; it's not likely to get enough use; it'd just be polluting intellisense. A regular static method is likely fine. – Servy Oct 15 '13 at 14:44
  • @Servy Good point with the enumerator. That was a left-over of an earlier attempt. Removed. About the extension method: That totally depends on the application of the OP. And it only shows up when the namespace is included. – Daniel Hilgarth Oct 15 '13 at 14:48
  • 1
    @Servy disagree on "intellisense pollution". Namespaces exist to avoid this. And extension method allows for chaining: x.SmoothGaps().Where(..) etc. – Pavel Tupitsyn Oct 15 '13 at 15:00
  • @Kefir You can chain the methods just fine without them being extensions. It's not quite as pretty, but it works just fine. The point is that you generally want to use extension methods for things that you're using reasonably often. Yes, requiring the imported namespace helps, but even dealing with it throughout the rest of the file using that namespace can be annoying, if you know you won't use it. The point is most methods won't need to be extensions, and you should make them one if given a reason to do so, rather than the other way around. If this would be used enough, that's fine. – Servy Oct 15 '13 at 15:05
  • 1
    @Servy this is a matter of a personal preference, of course. As for me, readability is the most important thing. Code is written once, but can be read a lot. And extension method syntax (chaining) improves readability. – Pavel Tupitsyn Oct 15 '13 at 15:29
  • @downvoter: Please leave a comment, so I can fix whatever you think is wrong. – Daniel Hilgarth Oct 15 '13 at 18:41
1

Here is how you can do it purely within LINQ:

var data = new List<float?> { 0f, 1f, 2f, 5f, null, null, null, 7f, null, 8f };
var corrected = data
    .Select((v,i) => new {
        Index = i
        // Find the index of the next non-null item in the list
    ,   NextNonNull = i + data
            .Skip(i)
            .Select((vv,j) => new {j,vv})
            .First(p => p.vv.HasValue).j
    ,   Value = v
    })
    .GroupBy(p => p.NextNonNull)
    // For each group, insert its average g.Count() times
    .SelectMany(g => g.Select(e => data[g.Key]/g.Count()))
    .ToList();
for (var i = 0 ; i != data.Count ; i++ ) {
    Console.WriteLine("{0} - {1}", data[i], corrected[i]);
}

Disclaimer: this solution is provided for amusement purposes only. It is going to be slower than your solution that is based on a for loop, potentially adding an extra order to the complexity (i.e. making it O(n^2) instead of O(n)).

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
1

A pure LINQ version using Aggregate for your amusement :

float?[] A = { 0f, 1f, 2f, 5f, null, null, null, 7f, null, 8f };
var result = A.Aggregate(Tuple.Create(new List<float>(), 0), 
 (items, current) => 
 {
    if(current.HasValue)
    {
        if(items.Item2 == 0)
            items.Item1.Add(current.Value);
        else
        {
            var avg = current.Value / (items.Item2 + 1);
            for(int i = 0; i <= items.Item2; i++)
                items.Item1.Add(avg);
        }
        return Tuple.Create(items.Item1, 0);
    }
    else
        return Tuple.Create(items.Item1, items.Item2 + 1);
 }).Item1;

I wouldn't use this in production code because the head of an average developer will explode on Aggregate, using Tuple in C# always look kinda ugly and an imperative solution works well and is more understandable than this.

sloth
  • 99,095
  • 21
  • 171
  • 219
0

I don't think it can be done with pure linq (existing linq methods), but I'd write an iterator to do this:

public IEnumerable<float?> ProcessSequence(IEnumerable<float?> seq)
{
int nullCount = 0;
foreach(var x in seq)
{
    if (x == null)
    {
        nullCount++;
    }
    else if (nullCount > 0)
    {
        nullCount++;
        var mid = x / nullCount;
        for (var i = 0; i<nullCount; i++)
        {
            yield return mid;
        }
        nullCount = 0;
    }       
    else
    {
        yield return x;
    }
}
}
Pavel Tupitsyn
  • 8,393
  • 3
  • 22
  • 44
  • He's dealing with a sequence of floats, not ints, but that's not a major change. – Servy Oct 15 '13 at 14:42
  • @Servy, right, thanks. Can even make it completely generic with Func parameter to calculate average. – Pavel Tupitsyn Oct 15 '13 at 14:59
  • No, it'd need to be a `Func, T>` (or possibly a `Func`, instead, for a non-sequence solution) to calculate the average, if you wanted to support any type of `T`. – Servy Oct 15 '13 at 15:07