3

I need to write a function that splits records into separate files based on the value of a field. E.g. given the input:

[
  ["Paul" "Smith" 35]
  ["Jason" "Nielsen" 39]
  ["Charles" "Brown" 22]
  ]

We end up with a file "Paul", containing "Paul Smith 35", file "Jason", containing "Jason Nielsen 39", etc.

I don't know the names in advance, so I need to keep references for the writers, so that I can close them in the end.

The best I could come up with was using a ref to keep the writers, like this:

(defn write-split [records]
(let [out-dir (io/file "/tmp/test/")
      open-files (ref {})]
  (try
    (.mkdirs out-dir)
    (dorun
      (for [[fst lst age :as rec] records]
        (binding [*out* (or
                          (@open-files fst)
                          (dosync
                            (alter open-files assoc fst (io/writer (str out-dir "/" fst)))
                            (@open-files fst)))]
          (println (apply str (interpose " " rec))))))
    (finally (dorun (map #(.close %) (vals @open-files)))))))

This works, but feels horrible and, more importantly, runs out of heap, even though I only have five output files, which are open at the very beginning. Seems like something is being retained somehow...

Can anyone think of a more functional and Clojure-like solution?

EDIT: The input is big. Potentially gigabytes of data, hence the importance of memory efficiency, and the reluctance to close the files after every write.

George
  • 3,433
  • 4
  • 27
  • 25
  • The record data is so large you can't group it in memory? `(group-by first records)` and then just opening and closing a file for each new key in the returned map. – ponzao Jan 05 '12 at 12:50
  • Yes, it's big - gigabytes of data coming in through a lazy sequence. Ideally inifite. – George Jan 05 '12 at 13:34
  • Is there a requirement to keep the files open, or is the total number of files that need to be opened finite? Most OS's only allow a certain number of open files total. – deterb Jan 05 '12 at 13:51
  • Finite, small number of files - up to 20. Strictly speaking they don't need to be open, but closing the file after every write (as with the with-open macro) is not an option. – George Jan 05 '12 at 15:38

4 Answers4

3
(use '[clojure.string :only (join)])

(defn write-records! [records]
  (let [writers (atom {})]
    (try 
      (doseq [[filename :as record] records]
        (let [w (or (get @writers
                         filename)
                    (get (swap! writers assoc filename (writer filename)) filename))]
          (.write w (str (join " " record) "\n"))))
      (finally (dorun (map #(.close (second %)) @writers))
               (reset! writers {})))))
ponzao
  • 20,684
  • 3
  • 41
  • 58
2

I wonder if your problem of running out of heap is somehow related to the use of binding within the for. It looks like a new binding is required by your code for every record and maybe the old ones are being retained. (I may be completely wrong about this, clojure binding is a dark art to me).

You might consider having your main record sorting code put data onto queues (maybe one per logical file). Then have some "workers" (maybe writer functions closing over the appropriate out binding) pull from the queues using something from the java Executor libraries. (This question: "Sleeping a thread inside an ExecutorService (Java/Clojure)" might provide some hints.)

You would still have to handle gracefully shutting down the workers and closing the files somehow. (This other question "Clojure agents consuming from a queue" might suggest an approach.)

Good luck! Having to interface the abstraction of sequences over infinite data with the inevitable imperative statefulness of the file system ain't trivial (but hopefully still simpler in Clojure than in other languages).

Community
  • 1
  • 1
Alex Stoddard
  • 8,244
  • 4
  • 41
  • 61
0

with-open can handle the closing of files for you.

(ns sandbox.core
  (:require [clojure.java.io :as io]))

(def data [["Paul" "Smith" 35]
           ["Jason" "Nielsen" 39]
           ["Charles" "Brown" 22]])

(doseq [record data]
  (with-open [w (io/writer (first record))]
    (binding [*out* w]
      (apply println record))))

Based on your edits you don't want to open and close files all the time for performance reasons. One approach would then be to keep the writers in a cache. The following approach uses core.memoize to memoize the get-writer function. After all records have been written the cached writers are closed.

(defn write-data [data]
  (let [get-writer (memoize/memo #(io/writer % :append true))]
    (try
      (doseq [record data]
        (let [w (get-writer (first record))]
          (binding [*out* w]
            (apply println record))))
      (finally
       (dorun (map  #(.close %)
                    (vals (memoize/snapshot get-writer))))))))
Jonas
  • 19,422
  • 10
  • 54
  • 67
0
(use '[clojure.contrib.string :only [join]])

(def vecs [["Paul" "Smith" 35]["Jason" "Nielsen" 39]["Charles" "Brown" 22]]) 

(defn write-files [v] 
  (doseq [i v]
     (spit (i 0) ; the (0 1) gets the elem in the index 0 of the vec
            (join " " i))))

(write-files vecs)

this shoud work.

patz
  • 767
  • 2
  • 7
  • 13