speeding up deleting duplicates when they're adjacent

Question

I'm looking for something like #'delete-duplicates, but I know that all elements of the list are already sorted, or inversely sorted, or at least arranged so that duplicates will already be adjacent to each other. I wish to use that knowledge to ensure that execution speed is not proporational to the square of the number of elements in the list. It's trivial to use #'maplist to grow my own solution, but is there something already in the language? It would be embarrassing to reinvent the wheel.

To be clear, for largish lengths of lists, I would like the running time of the deletion to be proportional to the length of the list, not proportional to the square of that length. This is the behavior I wish to avoid:

 1 (defun one-shot (cardinality)
 2   (labels ((generate-list (the-count)
 3              (let* ((the-list (make-list the-count)))
 4                (do ((iterator 0 (1+ iterator)))
 5                  ((>= iterator the-count))
 6                  (setf (nth iterator the-list) iterator))
 7                the-list)))
 8     (let* ((given-list (generate-list cardinality))
 9            (stripped-list)
10            (start-time)
11            (end-time))
12       (setf start-time (get-universal-time))
13       (setf stripped-list (delete-duplicates given-list :test #'eql))
14       (setf end-time (get-universal-time))
15       (princ "for n = ")
16       (princ cardinality)
17       (princ ", #'delete-duplicates took ")
18       (princ (- end-time start-time))
19       (princ " seconds")
20       (terpri))))
21 (one-shot 20000)
22 (one-shot 40000)
23 (one-shot 80000)
for n = 20000, #'delete-duplicates took 6 seconds
for n = 40000, #'delete-duplicates took 24 seconds
for n = 80000, #'delete-duplicates took 95 seconds

I don't think there's anything like this in the language, but this is pretty much [run-length encoding](https://en.wikipedia.org/wiki/Run-length_encoding) without keeping the number of occurrences. Also, why not just use the [`time`](http://www.lispworks.com/documentation/HyperSpec/Body/m_time.htm) macro to do your timing (after being sure to compile your function)? — Joshua Taylor, Nov 04 '13 at 19:02
Joshua, you're correct on all counts, though I wanted to keep the output tidier than `time` does. — Bill Evans at Mariposa, Nov 04 '13 at 19:12
You are using a really bad way to create a list of numbers. Your sub-function GENERATE-LIST is using at least three big anti-patterns in Lisp. — Rainer Joswig, Nov 04 '13 at 19:51
Rainer, thanks for the heads up. Would you be so kind as to explain the first three anti-patterns you see? — Bill Evans at Mariposa, Nov 04 '13 at 21:36
You make a list of NILs - 1st loop. Then you count up in a separate loop - 2nd loop. Then you set the nth item in a list - another loop for each item. Just that traverses the list 1/2 N^2 times. **Anti patterns**: 1) multiple loops. 2) first list gets created and then the contents are set. 3) repeated access of the nth element. **Better**: 1) one loop. 2) during list creation set the items. 3) avoid NTH and just use one loop, here the initial. **Keep in mind**: Lists are singly linked cons cells. Lists are not arrays. — Rainer Joswig, Nov 05 '13 at 12:01
It's also worth pointing out that the list that's being constructed isn't a good test case for `delete-duplicates`. An implementation could validly make one pass through the list building up a hash table of elements to check whether there are any duplicated elements and then, if there are not, simply return the list. That would be linear in the length of the list, and correct. Your test list has no duplicate elements, so it's not really testing the main functionality of `delete-duplicates`. — Joshua Taylor, Nov 05 '13 at 12:55
Joshua, this is true. The original question, though, was whether there was a feature of the language that was known to take advantage of all duplicates already being clustered together; and the timing numbers, by demonstrating that doubling the length of the list quadrupled the time, illustrated what I was trying to avoid. And in the real life application, I have abandoned the list in favor of a hash. — Bill Evans at Mariposa, Nov 06 '13 at 01:50

score 4 · Answer 1 · edited May 23 '17 at 12:24

There's nothing like this in the language, but something like this makes just one pass through the list:

(defun delete-adjacent-duplicates (list &key key (test 'eql))
  (loop
     for head = list then (cdr head)
     until (endp head)
     finally (return list)
     do (setf (cdr head)
              (member (if (null key) (car head)
                          (funcall key (car head)))
                      (cdr head)
                      :key key :test-not test))))

As, @wvxvw pointed out, it might be possible to simplify this iteration using (loop for head on list finally (return list) do ...). However, 3.6 Traversal Rules and Side Effects says that modifying the cdr chain of a list during an object-traversal leads to undefined behavior. However, it's not clear whether loop for head on list is technically an object-traversal operation or not. The documentation about loop says in 6.1.2.1.3 The for-as-on-list subclause that

In the for-as-on-list subclause, the for or as construct iterates over a list. … The variable var is bound to the successive tails of the list in form1. At the end of each iteration, the function step-fun is applied to the list; the default value for step-fun is cdr. … The for or as construct causes termination when the end of the list is reached.

This says that the step function is always applied at the end of the iteration, so it sounds like loop for head on list should be OK. At any rate, any possible issues could be avoided by using do loop instead:

(defun delete-adjacent-duplicates (list &key key (test 'eql))
  (do ((head list (cdr head)))
      ((endp head) list)
    (setf (cdr head)
          (member (if (null key) (car head)
                      (funcall key (car head)))
                  (cdr head)
                  :key key :test-not test))))

The idea is to start with head being the list, then setting its cdr to the first tail that starts with a different element, then advancing the head, and continuing until there's nothing left. This should be linear in the length of the list, assuming that member is implemented in a sensible way. The use of member means that you don't have to do any extra work to get :key and :test working in the appropriate way. (Do note that :test for del-dups is going to be the :test-not of member.) Note: there's actually a slight issue with this, in that the key function will called twice for each element in the final list: once when it's the first element of a tail, and once when it's the car of head.

CL-USER> (delete-adjacent-duplicates (list 1 1 1 1 2 2 3 3 3))
(1 2 3)
CL-USER> (delete-adjacent-duplicates (list 1 2 2))
(1 2)
CL-USER> (delete-adjacent-duplicates (list 1 3 5 6 4 2 3 5) :key 'evenp)
(1 6 3)

I expect that any linear time solution is going to take a similar approach; hold a reference to the current head, find the next tail that begins with a different element, and then make that tail the cdr of the head.

Wouldn't `for head = list then (cdr head)` be the same as `for head on list`? — , Nov 04 '13 at 23:30
@wvxvw I don't think so, because we're modifying the structure of `list` during the execution of the loop (though `head`). `for head on list` might be permitted to do something “clever” like grabbing the cell for the next iteration _before_ executing the `do` part, but we actually want to be sure to modify the structure, and _then_ take the `cdr`. I think that [3.6 Traversal Rules and Side Effects](http://www.lispworks.com/documentation/HyperSpec/Body/03_f.htm) may be relevant here. — Joshua Taylor, Nov 05 '13 at 00:48
@wvxvw Then again, [6.1.2.1.3 The for-as-on-list subclause](http://www.lispworks.com/documentation/HyperSpec/Body/06_abac.htm): says, “The variable var is bound to the successive tails of the list in form1. At the end of each iteration, the function step-fun is applied to the list; the default value for step-fun is cdr.” That sounds like `for head on list` is OK. I'm not sure which of this and “3.6 Traversal Rules and Side Effects” applies more. It might depend on whether `for-as-on-list` is considered an object-traversal form or not. — Joshua Taylor, Nov 05 '13 at 00:52
I never researched it in depth, but I used it with destructive tree traversal functions and I don't remember problems doing that, but I really only used SBCL to do it. — , Nov 05 '13 at 09:53
I never understood the appeal of `loop`... Anyway, (re RLE) `map head . group` does the same, under lazy eval (except that it makes new structure, of course). — Will Ness, Nov 05 '13 at 20:42
@WillNess I think some of `loop`'s appeal is that it makes it easy to get things done quickly, and in readable fashion. It's sometimes more readable than `do`, and it makes it easier to mix various kinds of iteration. That said, I think it's usually worthwhile asking, after writing a `loop`, “OK, I've got something that works, now can this be simplified?” — Joshua Taylor, Nov 05 '13 at 20:59

m-n · Answer 2 · 2013-11-07T20:36:15.570

4

I would expect REMOVE-DUPLICATES to have a linear time implementation. (And indeed it does* on my local SBCL install.)

Note that REMOVE-DUPLICATES and DELETE-DUPLICATES are specified to have the same return value, and that the side effects of DELETE-DUPLICATES are not guaranteed.

* The linear time code path is only taken when the :test is #'eq,#'eql, #'equal, or #'equalp (it relies on a hash table) and there is no :key or :test-not argument supplied.

edited Nov 07 '13 at 20:36

answered Nov 05 '13 at 06:39

m-n

1,476
1
9
9

1

it can have at best `O(n log n)`, and only if the argument list's elements *can be ordered* by some criterion `<` such that `not (a < b) AND not (b < a)` is equivalent to `(a == b)`. – Will Ness Nov 05 '13 at 11:13
2

@WillNess no, it doesn't have to be `O(n log n)`. SBCL uses hash-table to store all elements seen so far and adds / removes next element based on what's in the hash-table. So it's `O(n)` though it has a large constant to go with it. – Nov 05 '13 at 12:36
1

@wvxvw I deleted my comment here comparing the speed to Joshua Taylor's do based delete-adjacent-duplicates. In its current form his implementation is spending most of its time looking up 'eql, and if you replace that with #'eql it does indeed have a significantly better constant than remove-duplicates. – m-n Nov 06 '13 at 06:19
1

@wvxvw guys, hash tabling is only possible when comparing by `eql` and the like (where equality test forms an equivalence relation perhaps) ; is *this* linear in `n`: `(defun tt (n) (length (remove-duplicates (loop for i from 2 to n collect i) :test #'(lambda(a b)(> (gcd a b) 1)) :from-end t)))`? And if so, does `(tt 541)` evaluate to `100`? – Will Ness Nov 06 '13 at 09:36
@WillNess I'm glad you pointed that out. It uses a hash table only when the test is specified to work with hash tables (#'eq, #'eql, #'equal, #'equalp), and so doesn't give a linear time algorithm for generating a list of the primes. It also doesn't use the linear time code path when they key argument or test-not argument are supplied. – m-n Nov 06 '13 at 10:49
@m-n right so the adjacents removal has its place too. :) It'll always be linear. – Will Ness Nov 06 '13 at 10:51
1

quite interresting though to note, might want to edit it into your answer for future users to see. – Sim Nov 07 '13 at 20:00

score 2 · Answer 3 · edited May 23 '17 at 11:53

2

There is nothing like that in the language standard. However, you can do that either with a loop:

(defun remove-adjacent-duplicates (list &key (test #'eql))
  (loop for obj in list 
        and prev = nil then obj 
        for take = t then (not (funcall test obj prev))
        when take collect obj))

or with reduce (exercise left to the reader).

See the other answer for a destructive implementation.

PS. Unless you are doing something tricky with timing, you are much better off using time.

edited May 23 '17 at 11:53

Community

1
1

answered Nov 04 '13 at 19:21

sds

58,617
29
161
278

Do you have a particular `reduce` based implementation in mind? It's easy to _remove_ duplicates with `reduce`, but an efficient `delete` that reuses list structure seems particularly difficult since the function passed to `reduce` would get the _elements_ of the list rather than the cons cells that are the ‘spine’ of the list. – Joshua Taylor Nov 04 '13 at 20:19
@JoshuaTaylor: sorry, I meant a non-destructive remove. – sds Nov 04 '13 at 20:28

score 2 · Answer 4 · answered Nov 04 '13 at 20:03

2

For the record: your test code is basically just this:

(defun one-shot (n &aux (list (loop for i below n collect i)))
  (time (delete-duplicates list))
  (values))

It might also be useful to talk to the implementation maintainers in the case of a slow delete-duplicates.

For example (one-shot 1000000) runs in a second in CCL on my Mac. In LispWorks it runs in 0.155 seconds.

answered Nov 04 '13 at 20:03

Rainer Joswig

136,269
10
221
346

1

But even if `delete-duplicates` is fast in this case (and note that there are no duplicates in this case; it could be doing a single pass over the list, recognizing that, and returning the list), this doesn't address whether its performance is generally _linear_ with the length of the list or not. The OP's task can be performed in linear time, but (i) `delete-duplicates` isn't guaranteed to be linear, and (ii) `delete-duplicates` doesn't solve the OP's problem of removing only adjacent duplicates, e.g., in the case of a list like `(1 1 2 1 1)` where the output should be `(1 2 1)`. – Joshua Taylor Nov 04 '13 at 22:27
1

Rainer, your timing figures are intriguing. I'd hate to admit what machine I'm running this on. But my interest was not primarily in how fast it runs, but how it scales: does doubling the length of the list double the time (good), or quadruple the time (bad)? Joshua rightly emphasizes this point. – Bill Evans at Mariposa Nov 04 '13 at 22:44
1

@BillEvansatMariposa cf. http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth – Will Ness Nov 05 '13 at 11:08
@Will Ness: My point exactly, but wikipedia expresses it more elegantly. – Bill Evans at Mariposa Nov 05 '13 at 11:28

score 2 · Answer 5 · edited Jun 20 '20 at 09:12

A bit different approach:

(defun compress-duplicates (list &key (test #'eql))
  (labels ((%compress-duplicates (head tail)
             (if (null tail)
               (setf (cdr head) tail)
               (progn (unless (funcall test (car head) (car tail))
                        (setf (cdr head) tail head (cdr head)))
                      (%compress-duplicates head (cdr tail))))))
    (%compress-duplicates list (cdr list)) 
    list))
                  
(compress-duplicates (list 1 1 1 2 2 3 4 4 1 1 1))
;; (1 2 3 4 1)

Test of SBCL `delete-duplicates` implementation:

(defun test-delete-duplicates ()
  (labels ((%test (list)
             (gc)
             (time (delete-duplicates list))))
    (loop
       :repeat 6
       :for list := (loop :for i :from 0 :below 1000
                       :collect (random 100))
       :then (append list list) :do (%test (copy-list list)))))

;; (test-delete-duplicates)

;; Evaluation took:
;;   0.002 seconds of real time
;;   0.002000 seconds of total run time (0.002000 user, 0.000000 system)
;;   100.00% CPU
;;   3,103,936 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.003 seconds of real time
;;   0.003000 seconds of total run time (0.003000 user, 0.000000 system)
;;   100.00% CPU
;;   6,347,431 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.006 seconds of real time
;;   0.006000 seconds of total run time (0.005000 user, 0.001000 system)
;;   100.00% CPU
;;   12,909,947 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.012 seconds of real time
;;   0.012000 seconds of total run time (0.012000 user, 0.000000 system)
;;   100.00% CPU
;;   25,253,024 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.023 seconds of real time
;;   0.022000 seconds of total run time (0.022000 user, 0.000000 system)
;;   95.65% CPU
;;   50,716,442 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.049 seconds of real time
;;   0.050000 seconds of total run time (0.050000 user, 0.000000 system)
;;   102.04% CPU
;;   106,747,876 processor cycles
;;   0 bytes consed

Shows linear speed.

Test of ECL `delete-duplicates` implementation:

;; (test-delete-duplicates)
;; real time : 0.003 secs
;; run time  : 0.003 secs
;; gc count  : 1 times
;; consed    : 95796160 bytes
;; real time : 0.007 secs
;; run time  : 0.006 secs
;; gc count  : 1 times
;; consed    : 95874304 bytes
;; real time : 0.014 secs
;; run time  : 0.014 secs
;; gc count  : 1 times
;; consed    : 95989920 bytes
;; real time : 0.028 secs
;; run time  : 0.027 secs
;; gc count  : 1 times
;; consed    : 96207136 bytes
;; real time : 0.058 secs
;; run time  : 0.058 secs
;; gc count  : 1 times
;; consed    : 96617536 bytes
;; real time : 0.120 secs
;; run time  : 0.120 secs
;; gc count  : 1 times
;; consed    : 97412352 bytes

Linear time increase too.

Don't go modifying literal data, now; surely you meant `(compress-duplicates (list 1 1 1 1 … 5))`. ;) — Joshua Taylor, Nov 05 '13 at 00:59
@JoshuaTaylor ah yes, alright, eventually it works like this too when tested in REPL, but I'll change it to avoid needless misunderstanding. — , Nov 05 '13 at 09:55
@ why not simply `(%compress-duplicates list (cdr list)) list` as body of `labels`? — Will Ness, Nov 05 '13 at 18:42
@WillNess `%compress-duplicates` returns the end of the list, so that wouldn't work, but if I carried over the head of the list it could, but it seemed easier to just capture the head of the list and then return it rather then just pass it through to the end w/o changing it. — , Nov 05 '13 at 18:59
But there's `list` for that after the call. You don't need to capture `list`, it's already there. — Will Ness, Nov 05 '13 at 19:03
that's the classic top-down list processor you wrote there: loop to do stuff for its effect, then return the changed object. (there's no tail-rec guarantee in CL though...). — Will Ness, Nov 05 '13 at 20:45

speeding up deleting duplicates when they're adjacent

5 Answers5

Test of SBCL delete-duplicates implementation:

Test of ECL delete-duplicates implementation:

Test of SBCL `delete-duplicates` implementation:

Test of ECL `delete-duplicates` implementation: