10

Short version:

The function passed as the fourth argument to accumarray sometimes gets called with arguments that are not consistent with specifications encoded the first argument to accumarray.

As a result, functions used as arguments to accumarray must test for what are, in effect, anomalous conditions.

The question is: how can an a 1-expression anonymous function test for such anomalous conditions? And more generally: how can write anonymous functions that are robust to accumarray's undocumented behavior?


Full version:

The code below is a drastically distilled version of a problem that ate up most of my workday today.

First some definitions:

idxs = [1:3 1:3 1:3]';

vals0 = [1   4 6   3 5 7   6 Inf 2]';
vals1 = [1 Inf 6   3 5 7   6   4 2]';

anon = @(x) max(x(~isinf(x)));

Note vals1 is obtained from vals0 by swapping elements 2 and 8. The "anonymous" function anon computes the maximum among the non-infinite elements of its input.

Given these definitions, the two calls below

accumarray(idxs, vals0, [], anon)
accumarray(idxs, vals1, [], anon)

which differ only in their second argument (vals0 vs vals1), should produce identical results, since the difference between vals0 and vals1 affects only the ordering of the values in the argument to one of the calls to anon, and the result of this function is insensitive to the ordering of elements in its argument.

As it turns out the first of these two expressions evaluates normally and produces the right result1:

>> accumarray(idxs, vals0, [], anon)
ans =
     6
     5
     7

The second one, however, fails with:

>> accumarray(idxs, vals1, [], anon)
Error using accumarray
The function '@(x)max(x(~isinf(x)))' returned a non-scalar value.

To troubleshoot this problem, all I could come up with2 was to write a separate function (in its own file, of course, "the MATLAB way")

function out = kluge(x)
    global ncalls;
    ncalls = ncalls + 1;
    y = ~isinf(x);
    if any(y)
        out = max(x(y));
    else
        {ncalls x}
        out = NaN;
    end
end

...and ran the following:

>> global ncalls;
>> ncalls = int8(0); accumarray(idxs, vals0, [], @kluge)
ans =
     6
     5
     7
>> ncalls = int8(0); accumarray(idxs, vals1, [], @kluge)
ans = 
    [2]    [Inf]

ans =
     6
     5
     7

As one can see from the output of the last call to accumarray above, the argument to the second call to the kluge callback was the array [Int]. This tells me beyond any doubt that accumarray is not behaving as documented3 (since idxs specifies no arrays of length 1 to be passed to accumarray's function argument).

In fact, from this and other tests I determined that, contrary to what I expected, the function passed to accumarray is called more than max(idxs) (= 3) times; in the expressions involving kluge above it's called 5 times.

The problem here is that if one cannot rely on how accumarray's function argument will actually be called, then the only way to make this function argument robust is to include in it a lot of extra code to perform the necessary checks. This almost certainly will require that the function have multiple statements, which rules out anonymous functions. (E.g. the function kluge above is robust more robust than anon, but I don't know how to fit into an anonymous function.) Not being able to use anonymous functions with accumarray greatly reduces its utility.

So my question is:

how to specify anonymous functions that can be robust arguments to accumarray?


1 I have removed blank lines from MATLAB's typical over-padding in all the MATLAB output shown in this post.
2 I welcome comments with any other troubleshooting suggestions you may have; troubleshooting this problem was a lot harder than it should be.
3 In particular, see items number 1 through 5 right after the line "The function processes the input as follows:".

chappjc
  • 30,359
  • 6
  • 75
  • 132
kjo
  • 33,683
  • 52
  • 148
  • 265
  • You say "This tells me beyond any doubt that `accumarra` is not behaving as documented (since `idxs` specifies no arrays of length 1 to be passed to `accumarray`'s function argument)." However, the documentation does not specify that the function is called _only_ for the whole array corresponding to each value of `idx`, does it? – Luis Mendo Feb 10 '14 at 23:14
  • 1
    Well, the documentation also doesn't say that the function won't erase my hard disk either... If the user cannot have any assurance about how `accumarray` will use the function argument passed to it, then the only functions that can be passed to `accumarray` reliably are ones that have to be grossly over-defensive. In good software design, a function that takes other functions as arguments specifies exactly how those argument functions will be called. Without this guarantee, it is not possible to write reliable code. – kjo Feb 10 '14 at 23:19
  • 1
    The documentation _does_ state that `fun` must return a scalar, so `accumarray` is entitled to go wrong (or not) if it doesn't. It's not just C that has undefined behaviour. – Notlikethat Feb 10 '14 at 23:32
  • @Notlikethat Exactly. It says "`fun` is a function that accepts a column vector and returns a numeric, logical, or char scalar, or a scalar cell." It doesn't say "`fun` is a function that accepts a column vector formed by all same-index values and returns..." – Luis Mendo Feb 10 '14 at 23:34
  • @Notlikethat: My point is that if one is going to specify a function as an anonymous function, one does not have the luxury to test all the possible conditions beyond the ones that one actually wants to compute. If `accumarray` is going to be so liberal in its use of the argument that is passed to it, then it cannot be specified as an anonymous function. That's the whole point of this post. So the choice is between: "the documentation is incorrect" or "the implementation ineptly renders the function inconvenient". – kjo Feb 10 '14 at 23:42
  • 1
    As a side issue: you can remove those blank lines with `format compact` – Luis Mendo Feb 10 '14 at 23:58
  • @LuisMendo: thanks for the tip! I work half the time on a laptop, where vertical space is in short supply. – kjo Feb 11 '14 at 00:01
  • 1
    It's not specified _how_ `fun` is called, thus you can't make assumptions. If `anon` returned scalar for _any possible combination of the given input_, that would be OK. However, it doesn't, thus you can't guarantee success. Anonymous functions are pretty limited and it's often simply not possible to get write one for a given task - I pretty much gave up trying to use them because of this. An extra 3 or 4 lines for a robust implementation via a local or nested function isn't _that_ bad. – Notlikethat Feb 11 '14 at 00:35
  • @Notlikethat: OK, to each his own. If the function behaved *exactly as described in the documentation* (see in particular items numbered 1 through 5 after the sentence *The function processes the input as follows:...*) then the function would be robust enough to be passed relatively simple anonymous functions. See also my latest conjecture (in my last few comments to chappjc's answer) about what the implementation is actually doing. This conjecture is consistent with various tests I've run. If it's correct, then I'd say that the function does not match the behavior described in points 1-5. – kjo Feb 11 '14 at 00:44
  • @kjo: Even if it followed those points exactly on the first go, it would not necessarily be robust if a complete subarray of values contained just `Inf`s (consider that `anon([Inf Inf Inf])` returns an empty matrix too). Respectfully, it's just that `anon` needs to return a scalar **no matter what**. – chappjc Feb 11 '14 at 01:22
  • @chappjc: it would be robust inasmuch as the user would know *exactly* how its fourth argument would be called, so they can code it *minimally*, or at least know whether an anonymous function is suitable for the task. With the data I happen to be using in real life, the calls to `anon` are supposed to have arguments that are of the order of 100-long, and a frequency of `Inf` values that is under 0.1%. Furthermore, earlier steps have ruled out catastrophic errors in the data (such as an input of all `Inf`). With a properly implemented `accumarray`, `anon` as defined is perfectly robust ... – kjo Feb 11 '14 at 13:48
  • ...for the originally intended application. Functions that take callbacks need to say *exactly* how they're going to use those callbacks. Without this, callbacks would need to be designed to be as robust as an "all-comers" library function, and that is unnecessary inconvenience and a waste of programming time. – kjo Feb 11 '14 at 13:49

2 Answers2

7

Short answer

The fourth input argument of accumarray, anon in this case, must return a scalar for any input.

Long answer (and discussion about index sorting)

Consider the output when the indexes are sorted:

>> [idxsSorted,sortInds] = sort(idxs)
>> accumarray(idxsSorted, vals0(sortInds), [], anon)
ans =
     6
     5
     7
>> accumarray(idxsSorted, vals1(sortInds), [], anon)
ans =
     6
     5
     7

Now, all the documentation has to say about this is the following:

If the subscripts in subs are not sorted, fun should not depend on the order of the values in its input data.

How does this relate the trouble with anon? It is a clue, as this forces anon to be called for the complete set of values for a given idx rather than a subset/subarray, as Luis Mendo suggested.


Consider how accumarray would work for a non-sorted list of indexes and values:

>> [idxs vals0 vals1]
ans =
     1     1     1
     2     4   Inf
     3     6     6
     1     3     3
     2     5     5
     3     7     7
     1     6     6
     2   Inf     4
     3     2     2

For both vals0 and vals1, the Inf belongs to the set where idxs equals 2. Since idxs is not sorted, it does not process all values for idxs=2 in one shot, at first. The actual algorithm (implementation) is opaque, but it seems to start by assuming that idxs is sorted, processing each single-valued block of the first argument. This is verifiable by putting a breakpoint in fun, the function reference by fourth input argument. When it encounters a 1 in idxs for the second time, it seems to start over, but with subsequent calls to fun containing all the values for a given index. Presumably accumarray calls some implementation of unique to fully-segment idxs (incidentally, order is not preserved). As kjo suggests, this is the point where accumarray actually processes the inputs as described in the documentation, following steps 1-5 here ("Find out how many unique indices there are..."). As a result, it crashes for vals1, when anon(Inf) is called, but not for vals0, which instead calls anon(4) on the first try.

However, even if it followed those steps exactly on the first go, it would not necessarily be robust if a complete subarray of values contained just Infs (consider that anon([Inf Inf Inf]) returns an empty matrix too). It is a requirement, although an understated one, that fun must return a scalar. What is not clear from the documentation is that it must return a scalar, for any inputs, not just what is expected based on the high-level description of the algorithm.


Workaround:

anon = @(x) max([x(~isinf(x));-Inf]);
chappjc
  • 30,359
  • 6
  • 75
  • 132
  • I don't know what to say. Most of the examples in the documentation for `accumarray` feature first arguments whose numbers of dimensions are greater than 1, and for which the sorted order should at least be specified. For the single example that uses a one-dimensional first argument (namely Example 1), the first argument is not sorted. It seems strange that this first example of the documentation would disregard the requirement of using a sorted first argument. – kjo Feb 10 '14 at 23:28
  • Well, it's not a requirement in general, just if you require that "`fun` not depend on the order of the values in its input data". In your case it is a rather indirect dependency! That is certainly borderline covered by the documents to say the least. – chappjc Feb 10 '14 at 23:31
  • 1
    Now I think that the function's implementation does something like this: it starts out *assuming* that the first argument is sorted, and applies the functions to subarrays of the second argument corresponding to maximal single-valued blocks in the first argument. But once it gets information that contradicts the initial assumption of a sorted first argument, it switches to the behavior described in points 1-5 of the documentation (right after *"The function processes the input as follows:..."*). (If points 1-5 were followed *exactly* as described, the ordering of the first argument... – kjo Feb 10 '14 at 23:58
  • ...would not matter.) – kjo Feb 10 '14 at 23:59
  • @kjo Excellent conjecture. I'll certainly update my answer to that effect! – chappjc Feb 11 '14 at 00:02
  • Consistent with this conjecture, when I call `accumarray` with sorted arguments, the fourth (callable) argument (e.g. `kluge`) gets called the minimum number of times (namely, the number of distinct values in the first argument). – kjo Feb 11 '14 at 00:22
  • This is really interesting. Just for the hell of it I also tried the cell-array workaround: `cell2mat(accumarray(..., @(x) {max(...)}))`, which interestingly enough is only a few percent slower. – knedlsepp Jan 15 '15 at 00:16
6

The documentation does not say that anon is called only with the whole set1 of vals corresponding to each value of idx as its input. As seen in your example, it does get called with subsets thereof.

So the way to make anon robust seems to be: make sure it gives a scalar output when its input is any subset of vals (or maybe just any subset of each set with same-idx value). In your case, anon(inf) does not return a scalar.

1 It's actually an array, of course, but I think it's easier to describe this in terms of sets (and subsets).

Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • 2
    This is exactly the problem. It just so happens that sorting the indexes (and their corresponding values to go with them) prevents this `anon(Inf)` case from happening. Interesting. – chappjc Feb 10 '14 at 23:24
  • Well, the `Inf`s get mixed in with the other non-`Inf` values when `idxs` is sorted because the entire set of values for that `idxs` subscript is processed, so the issue is hidden. Why is that `Inf` by itself in `vals0` and not `vals1`? Hard to tell. – chappjc Feb 10 '14 at 23:29
  • @chappjc: Yes, there are many ways to make the error go away. I did not post most of them, because they are all equally puzzling. – kjo Feb 10 '14 at 23:30