5

Let's say I have a huge lazy seq and I want to iterate it so I can process on the data that I get during the iteration.

The thing is I want to lose head(GC'd) of lazy seq(that processed) so I can work on seqs that have millions of data without having OutofMemoryException.

I have 3 examples that I'm not sure.

Could you provide best practices(examples) for that purpose?

Do these functions lose head?

Example 1

(defn lose-head-fn
  [lazy-seq-coll]
  (when (seq (take 1 lazy-seq-coll))
    (do
      ;;do some processing...(take 10000 lazy-seq-coll)
      (recur (drop 10000 lazy-seq-coll)))))

Example 2

(defn lose-head-fn
  [lazy-seq-coll]
  (loop [i lazy-seq-coll]
    (when (seq (take 1 i))
      (do
        ;;do some processing...(take 10000 i)
        (recur (drop 10000 i))))))

Example 3

(doseq [i lazy-seq-coll]
  ;;do some processing...
  )

Update: Also there is an explanation in this answer here

Ertuğrul Çetin
  • 5,131
  • 5
  • 37
  • 76
  • 2
    As far as i know, all of the above would lose head (first two are obvious, since you manually drop the head, while `doseq`'s doc claims that it doesn't retain head). So if the `lazy-seq-coll` you pass to the function isn't bound somewhere else with `def` or `let` and used later, there should be nothing to worry about. So `(lose-head-fn (range))` won't eat all your memory, while `(def r (range)) (lose-head-fn r)` probably would – leetwinski Jun 16 '17 at 12:40
  • and the only best practive i could think of, is not to `def` possibly infinite (or just huge) sequences, because all of their realized items would live forever in the var – leetwinski Jun 16 '17 at 12:44
  • 1
    @leetwinski This is exactly right. I hope you will post it as an answer so there will be something contradicting the other answer, which is currently wrong. – amalloy Jun 16 '17 at 16:49
  • @leetwinski yes you should post it as an answer with additional information so new comers can understand! – Ertuğrul Çetin Jun 16 '17 at 17:57
  • thanks! I wasn't 100% sure – leetwinski Jun 16 '17 at 20:11

2 Answers2

5

copy of my above comments

As far as I know, all of the above would lose head (first two are obvious, since you manually drop the head, while doseq's doc claims that it doesn't retain head).

That means that if the lazy-seq-coll you pass to the function isn't bound somewhere else with def or let and used later, there should be nothing to worry about. So (lose-head-fn (range)) won't eat all your memory, while

(def r (range)) 

(lose-head-fn r) 

probably would.

And the only best practice I could think of is not to def possibly infinite (or just huge) sequences, because all of their realized items would live forever in the var.

Thumbnail
  • 13,293
  • 2
  • 29
  • 37
leetwinski
  • 17,408
  • 2
  • 18
  • 42
  • 1
    See the discussion called [*Don't hang (onto) your head*](https://clojure.org/reference/lazy#_don_t_hang_onto_your_head) in the article [*Making Clojure Lazier*](https://clojure.org/reference/lazy). I think it confirms your conclusions. – Thumbnail Jun 17 '17 at 06:49
5

In general, you must be careful not to retain a reference either locally or globally for a part of a lazy seq that precedes another which involves excessive computation.

For example:

(let [nums (range)
      first-ten (take 10 nums)]
  (+ (last first-ten) (nth nums 100000000)))
=> 100000009

This takes about 2 seconds on a modern machine. How about this though? The difference is the last line, where the order of arguments to + is swapped:

;; Don't do this!
(let [nums (range)
      first-ten (take 10 nums)]
  (+ (nth nums 100000000) (last first-ten)))

You'll hear your chassis/cpu fans come to life, and if you're running htop or similar, you'll see memory usage grow rather quickly (about 1G in the first several seconds for me).

What's going on?

Much like a linked list, elements in a lazy seq in clojure reference the portion of the seq that comes next. In the second example above, first-ten is needed for the second argument to +. Thus, even though nth is happy to hold no references to anything (after all, it's just finding an index in a long list), first-ten refers to a portion of the sequence that, as stated above, must hold onto references to the rest of the sequence.

The first example, by contrast, computes (last first-ten), and after this, first-ten is no longer used. Now the only reference to any portion of the lazy sequence is nums. As nth does its work, each portion of the list that it's finished with is no longer needed, and since nothing else refers to the list in this block, as nth walks the list, the memory taken by the sequence that has been examined can be garbage collected.

Consider this:

;; Don't do this!
(let [nums (range)]
  (time (nth nums 1e8))
  (time (nth nums 1e8)))

Why does this have a similar result as the second example above? Because the sequence will be cached (held in memory) on the first realization of it (the first (time (nth nums 1e8))), because nums is being used on the next line. If, instead, we use a different sequence for the second nth, then there is no need to cache the first one, so it can be discarded as it's processed:

(let [nums (range)]
  (time (nth nums 1e8))
  (time (nth (range) 1e8)))
"Elapsed time: 2127.814253 msecs"
"Elapsed time: 2042.608043 msecs"

So as you work with large lazy seqs, consider whether anything is still pointing to the list, and if anything is (global vars being a common one), then it will be held in memory.

Josh
  • 4,726
  • 2
  • 20
  • 32