12

While trying to paste images, I noticed that Cases[] is very slow.

To reproduce, first copy a large image to the clipboard (just press Print Screen), then evaluate the following:

In[33]:= SetSystemOptions["PackedArrayOptions" -> "UnpackMessage" -> True];

In[34]:= AbsoluteTiming[nb = NotebookGet@ClipboardNotebook[];]
Out[34]= {0.4687500, Null}

In[35]:= AbsoluteTiming[d1 = nb[[1, 1, 1, 1, 1, 1, 1]];]
Out[35]= {0., Null}

In[36]:= AbsoluteTiming[d2 = First@Cases[nb, r_RasterBox :> First[r], Infinity, 1];]

During evaluation of In[36]:= Developer`FromPackedArray::unpack: Unpacking array in call to Notebook. >>

Out[36]= {0.9375000, Null}

(I did this on Windows, not sure if the paste code is the same on other systems.)

Note that extracting the data using Cases is extremely slow compared to using Part directly, even though I explicitly tell Cases that I need only one match.

I did find out (as shown above) that Cases triggers unpacking for some reason, even though the search should stop before it reaches the packed array inside. Using a shallower level specification than Infinity might avoid unpacking.

Question: Using Cases here is both easier and more reliable than Part (what if the subexpression can appear in different positions?) Is there a way to make Cases fast here, perhaps by using a different pattern or different options?


Possibly related question: Mathematica's pattern matching poorly optimized? (This is why I changed the Cases rule from RasterBox[data_, ___] -> data to r_RasterBox :> First[r].)

Community
  • 1
  • 1
Szabolcs
  • 24,728
  • 9
  • 85
  • 174

1 Answers1

16

I don't have access to Mathematica right now, so what follows is untested. My guess is that Cases unpacks here because it searches depth-first, and so sees the packed array first. If this is correct, then you could use rules instead (ReplaceAll, not Replace), and throw an exception upon first match:

Module[{tag},
   Catch[
     nb /. r_RasterBox :> Block[{}, Throw[First[r], tag] /; True]; 
     $Failed, 
     tag]
]

As I said, this is just an untested guess.

Edit 2: an approach based on shielding parts of expression from the pattern-matcher

Preamble

In the first edit (below) a rather heavy approach is presented. In many cases, one can take an alternative route. In this particular problem (and many others like it), the main problem is to somehow shield certain sub-expressions from the pattern-matcher. This can be achieved also by using rules, to temporarily replace the parts of interest by some dummy symbols.

Code

Here is a modification of Cases which does just that:

Clear[casesShielded];
casesShielded[expr_,pt_,shieldPattern_,levspec_,n_,opts:OptionsPattern[]]:=
   Module[{dummy,inverseShieldingRules, shielded, i=0},
      inverseShieldingRules =
        If[#==={},#,Dispatch@First@#]&@
           Reap[shielded= expr/.(p:shieldPattern):>
             With[{eval = With[{ind = ++i},Sow[dummy[ind]:>p];dummy[ind]]},
                eval/;True];
           ][[2]];
      Cases[shielded,pt,levspec,n,opts]/.inverseShieldingRules]; 

This version of Cases has one additional parameter shieldPattern (third one), which indicates which sub-expressions must be shielded from the pattern-matcher.

Advantages and applicability

The code above is pretty light-weight (compared to the suggestion of edit1 below), and it allows one to fully reuse and leverage the existing Cases functionality. This will work for cases when the main pattern (or rule) is insensitive to shielding of the relevant parts, which is a rather common situation (and in particular, covers patterns of the type _h, including the case at hand). This may also be faster than the application of myCases (described below).

The case at hand

Here, we need this call:

In[55]:=    
(d4=First@casesShielded[nb,x_RasterBox:>First@x,
    p_List/;Developer`PackedArrayQ[p],Infinity,1]);//Timing

Out[55]= {0.,Null}

and the result is of course the same as before:

In[61]:= d2===d4
Out[61]= True

Edit: an alternative Cases-like function

Motivation and code

It took me a while to produce this function, and I am not 100 percent sure it always works correctly, but here is a version of Cases which, while still working depth-first, analyzes expression as a whole before sub-expressions:

ClearAll[myCases];
myCases[expr_, lhs_ :> rhs_, upToLevel_: 1, max : (_Integer | All) : All, 
    opts : OptionsPattern[]] :=
 Module[{tag, result, f, found = 0, aux},
   With[{
    mopts = FilterRules[{opts}, {Heads -> False}],
    frule =
       Apply[
         RuleDelayed,
         Hold[lhs, With[{eval =  aux}, Null /; True]] /.
            {aux :> Sow[rhs, tag] /; max === All, 
             aux :> (found++; Sow[rhs, tag])}
       ]
    },
    SetAttributes[f, HoldAllComplete];
    If[max =!= All,
       _f /; found >= max := Throw[Null, tag]
    ];
    f[x_, n_] /; n > upToLevel := Null;
    f[x_, n_] :=
      Replace[
       HoldComplete[x],
       {
          frule,
          ex : _[___] :> 
            With[{ev = 
              Replace[
                HoldComplete[ex],
                y_ :> With[{eval = f[y, n + 1]}, Null /; True],
                {2},
                Sequence @@ mopts
              ]}, 
              Null /; True
            ]
       },
       {1}
      ]
   ]; (* external With *)
   result = 
     If[# === {}, #, First@#] &@
        Reap[Catch[f[expr, 0], tag], tag, #2 &][[2]];
   (* For proper garbage-collection of f *)
   ClearAll[f]; 
   result
 ]

How it works

This is not the most trivial piece of code, so here are some remarks. This version of Cases is based on the same idea I suggested first - namely, use rule-substitution semantics to first attempt the pattern-match on an entire expression and only if that fails, go to sub-expressions. I stress that this is still the depth-first traversal, but different from the standard one (which is used in most expression-traversing functions like Map, Scan, Cases, etc). I use Reap and Sow to collect the intermediate results (matches). The trickiest part here is to prevent sub-expressions from evaluation, and I had to wrap sub-expressions into HoldComplete. Consequently, I had to use (a nested version of the) Trott-Strzebonski technique (perhaps, there are simpler ways, but I wasn't able to see them), to enable evauation of rules' r.h.sides inside held (sub)expressions, and used Replace with proper level spec, accounting for extra added HoldComplete wrappers. I return Null in rules, since the main action is to Sow the parts, so it does not matter what is injected into the original expression at the end. Some extra complexity was added by the code to support the level specification (I only support the single integer level indicating the maximal level up to which to search, not the full range of possible lev.specs), the maximal number of found results, and the Heads option. The code for frule serves to not introduce the overhead of counting found elements in cases when we want to find all of them. I am using the same Module-generated tag both as a tag for Sow, and as a tag for exceptions (which I use to stop the process when enough matches have been found, just like in my original suggestion).

Tests and benchmarks

To have a non-trivial test of this functionality, we can for example find all symbols in the DownValues of myCases, and compare to Cases:

In[185]:= 
And@@Flatten[
    Outer[
       myCases[DownValues[myCases],s_Symbol:>Hold[s],#1,Heads->#2]  ===
       Cases[DownValues[myCases],s_Symbol:>Hold[s],#1,Heads->#2]&,
       Range[0,20],
       {True,False}
    ]]

Out[185]= True

The myCases function is about 20-30 times slower than Cases though:

In[186]:= 
Do[myCases[DownValues[myCases],s_Symbol:>Hold[s],20,Heads->True],{500}];//Timing
Out[186]= {3.188,Null}

In[187]:= Do[Cases[DownValues[myCases],s_Symbol:>Hold[s],20,Heads->True],{500}];//Timing
Out[187]= {0.125,Null}

The case at hand

It is easy to check that myCases solves the original problem of unpacking:

In[188]:= AbsoluteTiming[d3=First@myCases[nb,r_RasterBox:>First[r],Infinity,1];]
Out[188]= {0.0009766,Null}

In[189]:= d3===d2
Out[189]= True

It is hoped that myCases can be generally useful for situations like this, although the performance penalty of using it in place of Cases is substantial and has to be taken into account.

Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • `Cases[{{{1}}}, _, Infinity]` returns `{1, {1}, {{1}}}`, supporting the depth-first hypothesis. Also, `On["Packing"]; Cases[z[Range[10]], _, {1}]` does not issue an unpacking warning but `On["Packing"]; Cases[z[Range[10]], _, {2}]` does. This suggests that `Cases` unconditionally unpacks arrays once it determines that it needs to scan them. +1 – WReach Jan 02 '12 at 17:02
  • Great Leonid! This works. I needed to add point-evaluation to make it work in my case (apparently something is holding the expression unevaluated), and I also made it return `$Failed` when the expression is not found (so it won't return the whole huge expression). I edited your post. – Szabolcs Jan 02 '12 at 17:11
  • @WReach I think `Cases[z[Range[10]], _, {2}]` *must* unpack because we're explicitly asking to search level 2. `Cases[{z[Range[10]]}, _z, 3, 1]` however doesn't need to because we're telling it that once it found a single match, it can return. When it finds it, it still hasn't reached the packed array, so it wouldn't in theory need to touch it. I guess there's some room for optimization for this (admittedly uncommon) situation. Do you agree? – Szabolcs Jan 02 '12 at 17:19
  • I think it's worth pointing out that once `ReplaceAll` has found a match, it won't search subexpressions of that match anymore, while `Cases` does. The last argument of `Cases` doesn't seem to prevent this. Now it makes sense to me, but it wasn't something that's very intuitive or easy to figure out ... – Szabolcs Jan 02 '12 at 17:22
  • @Szabolcs I agree that there is room for optimization in `Cases`. It is not immediately obvious to me that an array _must_ be unpacked to scan it -- but then again I don't know what's going on under the covers here. A breadth-first searching variant of `Cases` could be useful too. I was able to prevent the unpacking by limiting the search to level 6 or less, e.g. `First@Cases[nb, r_RasterBox :> First[r], {6}, 1]`, though that is obviously fragile. – WReach Jan 02 '12 at 17:34
  • @WReach The statement on depth-first nature of `Cases` wasn't a hypothesis - I knew it is this way with `Cases`. The hypothesis was that `Cases` unpacks unconditionally. This turned out to be true, and I also think that `Cases` could be optimized better for purely syntactic patterns. Regarding depth-first, there are many variations of it, and the one where the entire expression is analyzed before parts would suit us here, still being depth-first (and arguably, will better suit this problem than breadth-first version would). – Leonid Shifrin Jan 02 '12 at 20:20
  • @Szabolcs Well, I am happy this worked for you, and thanks for the accept. I added a more general solution for this sort of problems, based on the same idea - see if you like it more than the original suggestion (it may however be a road to madness, so be warned:)) – Leonid Shifrin Jan 02 '12 at 23:36
  • @Szabolcs There is another solution to this problem, which is based on a different idea of shielding parts of expression from the pattern-matcher. This one allows to more fully reuse the original `Cases` functionality. Generally, I think both methods are useful, since they generally cover different sets of use cases (although these sets do overlap). Please see my second edit for the code. – Leonid Shifrin Jan 03 '12 at 13:05