4

I'm trying to compute folder size in parallel. Maybe it's naive approach. What I do, is that I give computation of every branch node (directory) to an agent. All leaf nodes have their file sizes added to my-size. Well it doesn't work. :)

'scan' works ok, serially. 'pscan' prints only files from first level.

(def agents (atom []))
(def my-size (atom 0))
(def root-dir (clojure.java.io/file "/"))

(defn scan [listing]
  (doseq [f listing]
    (if (.isDirectory f)
      (scan (.listFiles f))
      (swap! my-size #(+ % (.length f))))))

(defn pscan [listing]
  (doseq [f listing]
    (if (.isDirectory f)
      (let [a (agent (.listFiles f))]
        (do (swap! agents #(conj % a))
            (send-off a pscan)
            (println (.getName f))))
    (swap! my-size #(+ %  (.length f))))))

Do you have any idea, what have i done wrong?

Thanks.

jppalencar
  • 178
  • 1
  • 10

2 Answers2

15

No need to keep state using atoms. Pure functional:

(defn psize [f]
  (if (.isDirectory f)
    (apply + (pmap psize (.listFiles f)))
    (.length f)))
Jürgen Hötzel
  • 18,997
  • 3
  • 42
  • 58
  • 2
    Newbie is a permanent state of mind when doing lisp/clojure ;) Thank you. – jppalencar May 03 '11 at 16:10
  • 1
    ürgenHötzel Well, it was a bit inspiring. But when you read clojure/docs you will find that 'pmap' is _"...Only useful for computationally intensive functions where the time of f dominates the coordination overhead."_ Solution of computing folder size, from my point of view is more a **producer/consumer problem**. When producers slice the root directory in to branch nodes and consumers count file sizes from these branches. It's more of setting up right chunksize of problem. I got inspired by another [post](http://stackoverflow.com/questions/2760017/producer-consumer-with-qualifications) – jppalencar May 04 '11 at 04:43
3

So counting filesizes in parallel should be so easy?

It's not :)

I tried to solve this issue better. I realized that i'm doing blocking I/O operations so pmap doesn't do the job. I was thinking maybe giving chunks of directories (branches) to agents to process it independently would make sense. Looks it does :) Well I haven't benchmarked it yet.

It works, but, there might be some problems with symbolic links on UNIX-like systems.

(def user-dir (clojure.java.io/file "/home/janko/projects/"))
(def root-dir (clojure.java.io/file "/"))
(def run? (atom true))
(def *max-queue-length* 1024)
(def *max-wait-time* 1000)    ;; wait max 1 second then process anything left
(def *chunk-size* 64)
(def queue (java.util.concurrent.LinkedBlockingQueue. *max-queue-length* ))
(def agents (atom []))
(def size-total (atom 0))
(def a (agent []))

(defn branch-producer [node]
  (if @run?
    (doseq [f node]
      (when (.isDirectory f)
    (do (.put queue f)
        (branch-producer (.listFiles f)))))))

(defn producer [node]
  (future
    (branch-producer node)))

(defn node-consumer [node]
  (if (.isFile node)
    (.length node)
    0))

(defn chunk-length []
  (min (.size queue) *chunk-size*))

(defn compute-sizes [a]
  (doseq [i (map (fn [f] (.listFiles f)) a)]
    (swap! size-total #(+ % (apply + (map node-consumer i))))))

(defn consumer []
  (future
    (while @run?
      (when-let [size (if (zero? (chunk-length))
            false
            (chunk-length))] ;appropriate size of work
      (binding [a (agent [])]                    
        (dotimes [_ size]         ;give us all directories to process
          (when-let [item (.poll queue)]
            (set! a (agent (conj @a item)))))
        (swap! agents #(conj % a))
        (send-off a compute-sizes))
      (Thread/sleep *max-wait-time*)))))

You can start it by typing

    (producer (list user-dir))
    (consumer)

For result type

    @size-total

You can stop it by (there are running futures - correct me if I'm wrong)

    (swap! run? not)

If you find any errors/mistakes, you're welcome to share your ideas!

jppalencar
  • 178
  • 1
  • 10
  • Fixed to work with clojure 1.6.0, leiningen 2.5.0 and runnable binary. https://github.com/janpalencar/clojure-parallel-dirsize – jppalencar Aug 22 '15 at 16:57