Finding runs of similar, not identical, elements in Mathematica

Question

I have a list of numbers. I want to extract from the list runs of numbers that fall inside some band and have some minimum length. For instance, suppose the I want to operate on this list:

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, 1.4, -0.3, -0.1, -0.7}

with band=1 and runLength=3. I would like to have

{{-0.6, -0.8, -0.1}, {-0.3, -0.1, -0.7}}

as the result. Right now I'm using

Cases[
 Partition[thisList,runLength,1],
 x_ /; Abs[Max[x] - Min[x]] < band
]

The main problem is that where runs overlap, I get many copies of the run. For instance, using

thisList = {-1.2, -1.8, 1.5, -0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

gives me

{{-0.6, -0.8, -0.1}, {-0.8, -0.1, -0.5}, {-0.1, -0.5, -0.3}, {-0.5, -0.3, -0.1}, {-0.3, -0.1, -0.7}}

where I would rather have

{-0.6, -0.8, -0.1, -0.5, -0.3, -0.1, -0.7}

without doing some hokey reduction of the overlapping result. What's the proper way? It'd be nice if it didn't involve exploding the data using Partition.

Your definition of `band` seems a little incomplete. How does the run `{-0.6,-0.8,-0.1}` fit into `band = 1`? — rcollyer, Dec 13 '11 at 19:54
@rcollyer He defines it as `Abs[Max[x] - Min[x]] < band`. So `-0.1 + 0.8 < 1`, whereas including the element to the right or left will be `> 1`. — abcd, Dec 13 '11 at 19:57
@yoda, it seems I need to read the code at all before I comment ... — rcollyer, Dec 13 '11 at 19:58
Note that the question as it is now is still rather ambiguous. Consider a list `{1,1.5,2,2.4,2.6,2.9}`: there are two overlapping possibilities for the run length `3`and band `1.5`: `{1,1.5,2,2.4}` and `{1.5,2,2.4,2.6,2.9}`. So, which one do we pick? In my answer below, I used the convention that we pick the left-most one. One alternative formulation would be to pick the longest, but there are many other possibilities. — Leonid Shifrin, Dec 14 '11 at 08:07
Wow there is more subtlety in the task than I initially appreciated. Thanks to Leonid for such a comprehensive solution. — ArgentoSapiens, Dec 14 '11 at 14:52

score 6 · Accepted Answer · edited May 23 '17 at 12:00

EDIT

Apparenty, my first solution has at least two serious flaws: it is dead slow and completely impractical for lists larger than 100 elements, and it contains some bug(s) which I wasn't able to identify yet - it is missing some bands sometimes. So, I will provide two (hopefuly correct) and much more efficient alternatives, and I provide the flawed one below for any one interested.

A solution based on linked lists

Here is a solution based on linked lists. It allows us to still use patterns but avoid inefficiencies caused by patterns containing __ or ___ (when repeatedly applied):

ClearAll[toLinkedList];
toLinkedList[x_List] := Fold[{#2, #1} &, {}, Reverse@x]

ClearAll[accumF];
accumF[llFull_List, acc_List, {h_, t_List}, ctr_, max_, min_, band_, rLen_] :=
  With[{cmax = Max[max, h], cmin = Min[min, h]},
     accumF[llFull, {acc, h}, t, ctr + 1, cmax, cmin, band, rLen] /; 
        Abs[cmax - cmin] < band];
accumF[llFull_List, acc_List, ll : {h_, _List}, ctr_, _, _, band_, rLen_] /; ctr >= rLen :=
     accumF[ll, (Sow[acc]; {}), ll, 0, h, h, band, rLen];
accumF[llFull : {h_, t : {_, _List}}, _List, ll : {head_, _List}, _, _, _, band_, rLen_] :=
     accumF[t, {}, t, 0, First@t, First@t, band, rLen];
accumF[llFull_List, acc_List, {}, ctr_, _, _, _, rLen_] /; ctr >= rLen := Sow[acc];

ClearAll[getBandsLL];
getBandsLL[lst_List, runLength_Integer, band_?NumericQ] :=
  Block[{$IterationLimit = Infinity},
     With[{ll = toLinkedList@lst},
        Map[Flatten,
          If[# === {}, #, First@#] &@
            Reap[
              accumF[ll, {}, ll, 0, First@ll, First@ll, band,runLength]
            ][[2]]
        ]
     ]
  ];

Here are examples of use:

In[246]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[246]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[247]:= getBandsLL[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[247]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

The main idea of the function accumF is to traverse the number list (converted to a linked list prior to that), and accumulate a band in another linked list, which is passed to it as a second argument. Once the band condition fails, the accumulated band is memorized using Sow (if it was long enough), and the process starts over with the remaining part of the linked list. The ctr parameter may not be needed if we would choose to use Depth[acc] instead.

There are a few non-obvious things present in the above code. One subtle point is that an attempt to join the two middle rules for accumF into a single rule (they look very similar) and use CompoundExpression (something like (If[ctr>=rLen, Sow[acc];accumF[...])) on the r.h.s. would lead to a non-tail-recursive accumF (See this answer for a more detailed discussion of this issue. This is also why I make the (Sow[acc]; {}) line inside a function call - to avoid the top-level CompoundExpression on the r.h.s.). Another subtle point is that I have to maintain a copy of the linked list containing the remaining elements right after the last successful match was found, since in the case of unsuccessful sequence I need to roll back to that list minus its first element, and start over. This linked list is stored in the first argument of accumF.

Note that passing large linked lists does not cost much, since what is copied is only a first element (head) and a pointer to the rest (tail). This is the main reason why using linked lists vastly improves performance, as compared to the case of patterns like {___,x__,right___} - because in the latter case, a full sequences x or right are copied. With linked lists, we effectively only copy a few references, and therefore our algorithms behave roughly as we expect (linearly in the length of the data list here). In this answer, I also mentioned the use of linked lists in such cases as one of the techniques to optimize code (section 3.4).

Compile - based solution

Here is a straightforward but not too elegant function based on Compile, which finds a list of starting and ending bands positions in the list:

bandPositions = 
  Compile[{{lst, _Real, 1}, {runLength, _Integer}, {band, _Real}},
   Module[{i = 1, j, currentMin, currentMax, 
        startEndPos = Table[{0, 0}, {Length[lst]}], ctr = 0},
    For[i = 1, i <= Length[lst], i++,
      currentMin = currentMax = lst[[i]];
      For[j = i + 1, j <= Length[lst], j++,
        If[lst[[j]] < currentMin,
           currentMin = lst[[j]],
           (* else *)
           If[lst[[j]] > currentMax,
             currentMax = lst[[j]]
           ]
        ];
        If[Abs[currentMax - currentMin] >= band ,
          If[ j - i >= runLength,
             startEndPos[[++ctr]] = {i, j - 1}; i = j - 1
          ];
          Break[],
          (* else *)
          If[j == Length[lst] && j - i >= runLength - 1,
              startEndPos[[++ctr]] = {i, j}; i = Length[lst];
              Break[];
          ];
        ]
      ]; (* inner For *)
    ]; (* outer For *)
    Take[startEndPos, ctr]], CompilationTarget -> "C"];

This is used in the final function:

getBandsC[lst_List, runLength_Integer, band_?NumericQ] :=
   Map[Take[lst, #] &, bandPositions[lst, runLength, band]]

Examples of use:

In[305]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,1.4,-0.3,-0.1,-0.7},3,1]
Out[305]= {{-0.6,-0.8,-0.1},{-0.3,-0.1,-0.7}}

In[306]:= getBandsC[{-1.2,-1.8,1.5,-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7},3,1]
Out[306]= {{-0.6,-0.8,-0.1,-0.5,-0.3,-0.1,-0.7}}

Benchmarks

In[381]:= 
largeTest  = RandomReal[{-5,5},50000];
(res1 =getBandsLL[largeTest,3,1]);//Timing
(res2 =getBandsC[largeTest,3,1]);//Timing
res1==res2

Out[382]= {1.109,Null}
Out[383]= {0.016,Null}
Out[384]= True

Obviously, if one wants performance, Compile wins hands down. My observations for larger lists are that both solutions have approximately linear complexity with the size of the number list (as they should), with compiled one roughly 150 times faster on my machine than the one based on linked lists.

Remarks

In fact, both methods encode the same algorithm, although this may not be obvious. The one with recursion and patterns is arguably somewhat more understandable, but that is a matter of opinion.

A simple, but slow and buggy version

Here is the original code that I wrote first to solve this problem. This is based on a rather straightforward use of patterns and repeated rule application. As mentioned, one disadvantage of this method is its very bad performance. This is actually another case against using constructs like {___,x__,y___} in conjunction with repeated rule application, for anything longer than a few dozens elements. In the mentioned recommendations for code optimization techniques, this corresponds to the section 4.1.

Anyways, here is the code:

If[# === {}, #, First@#] &@
 Reap[thisList //. {
    left___, 
    Longest[x__] /;Length[{x}] >= runLength && Abs[Max[{x}] - Min[{x}]] < band,
    right___} :> (Sow[{x}]; {right})][[2]]

It works correctly for both of the original small test lists. It also looks generally correct, but for larger lists it often misses some bands, which can be seen by comparison with the other two methods. I wasn't so far able to localize the bug, since the code seems pretty transparent.

Leonid, I cannot recall seeing `f[...] := With[{...}, f[...] /; ...]` before. I frankly didn't think that would work. Is this explained in your book or elsewhere? — Mr.Wizard, Dec 14 '11 at 08:14
@Mr.Wizard No, it is not in the book (the book is very basic :)), but this is a standard semantics of variables shared between the body and the condition, which we discussed many times here on SO (let me know if you need references to past discussions using this). The fact that the r.h.s. contains a call to another `f` is irrelevant. What is IMO more interesting (or did you mean this one?) is that the call to `f` wrapped in `With` still keeps `f` tail-recursive in Mathematica sense - this must be related to the way scoping constructs are evaluated (contrast this with `CompoundExpression`). — Leonid Shifrin, Dec 14 '11 at 08:20
I guess this is a natural extension of the "Trott-Strzebonski" method, but I still haven't seen it used like that before. I have yet so much to learn. — Mr.Wizard, Dec 14 '11 at 08:21
I meant the simple case. I am still trying to understand tail-recursion in Mathematica. — Mr.Wizard, Dec 14 '11 at 08:27
@Mr.Wizard No, it is not an extension, Trott - Strzebonski is just one application of it - it is the language construct that *makes* Trott - Strzebonski possible. Ok, let me give some refs. Apparently, not so much of this stuff here at SO, but some on MathGroup: http://groups.google.com/group/comp.soft-sys.math.mathematica/browse_thread/thread/499a037dc0cc5773, and also in some of my posts here: http://groups.google.com/group/comp.soft-sys.math.mathematica/browse_thread/thread/81da38250f225b95 — Leonid Shifrin, Dec 14 '11 at 08:33
Leonid, this time I understand that; the reason for my second comment above is that in the past I was (subconsciously of course) seeing Trott-Strzebonski as "magic" and I had never internalized it. When you wrote "this is a standard semantics of variables shared between the body and the condition" it jogged my memory, and then I finally understood what I thought I already knew. :-) (ps thanks for the links) — Mr.Wizard, Dec 14 '11 at 08:44
@Mr.Wizard I actually found a ref from the past SO discussion - http://stackoverflow.com/questions/6560116/best-practices-in-error-reporting-mathematica/6563886#6563886, I mention this stuff there in a sub-section and briefly explain what it is. As to the Trott - Strzebonski method, I still feel it is a bit magical, even knowing the shared variables semantics. It becomes perhaps less magical if you use a version with `RuleCondition`, described by WReach in the same post devoted to Trott - Strzebonski technique. — Leonid Shifrin, Dec 14 '11 at 09:22
Now I confess that I simplified the problem for easier asking and I'm worried my simplification affects the answer. The actual input is a list of pairs, as `{{1,-1.2},{2,-1.8},{3,1.5}}` and while I'm searching for runs where the second element falls inside a band, the output I seek is the list of first elements corresponding to those runs. I'm sorry to request advice beyond your excellent solution, but is it possible to use your linked list approach with lists of pairs? — ArgentoSapiens, Dec 14 '11 at 16:08
@ArgentoSapiens It is much easier to use the compiled version, since it returns a list of positions. You could write a function `getFirstELements[pairs_List, runLength_Integer, band_?NumericQ] := Map[Take[pairs[[All,1]], #] &, bandPositions[pairs[[All,2]], runLength, band]]`, and that is the only change you need. For linked lists, it is possible but a bit harder, particularly because we can no longer use lists as nesting means when the elements are lists themselves. This is surely doable though. Let me know if my suggestion for the compiled solution works for you. — Leonid Shifrin, Dec 14 '11 at 16:31

score 2 · Answer 2 · answered Dec 13 '11 at 21:26

I'd try this instead:

thisList /. {___, Longest[a : Repeated[_, {3, Infinity}]], ___} :> 
               {a} /; Abs[Max@{a} - Min@{a}] < 1

where Repeated[_, {3, Infinity}] guarantees that you get at least 3 terms, and Longest ensures it gives you the longest run possible. As a function,

Clear[f]
f[list_List, band_, minlen_Integer?Positive] := f[list, band, minlen, Infinity]
f[list_List, band_, 
  minlen_Integer?Positive, maxlen_?Positive] /; maxlen >= minlen := 
 list /. {___, Longest[a : Repeated[_, {minlen, maxlen}]], ___} :> {a} /; 
    Abs[Max@{a} - Min@{a}] < band

This seems to only pick one possible band out of (generally) several present in a list. — Leonid Shifrin, Dec 14 '11 at 08:16

score 0 · Answer 3 · answered Dec 13 '11 at 22:35

Very complex answers given. :-) I think I have a simpler approach for you. Define to yourself what similarity means to you, and use GatherBy[] to collect all similar elements, or SplitBy[] to collect "runs" of similar elements (then remove "runs" of minimal unaccepted length, say 1 or 2, via DeleteCases[]).

Harder question is establishing similarity. By your method 1.2,0.9,1.9,0.8 would group first three elements, but not last three, but 0.9 and 0.8 are just as similar, and 1.9 would kick it out of your band. What about .4,.5,.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5 - where does similarity end?

Also look into ClusteringComponents[] and FindClusters[]

This won't work since this is not a transitive similarity - it can not be defined as a pairwise similarity / comparison function. — Leonid Shifrin, Dec 13 '11 at 22:53