2

I'm trying to have a function write a database sql dump to text file from a select statement. The volume returned can be very large, and I'm interested in doing this as fast as possible.

With a large result set I also need to log every x-interval the total number of rows written and how many rows per second have been written since last x-interval. I have a (map ) that is actually doing the write during a (with-open ) so i believe the side-effect of logging rows completed should happen there. (See comments in code).

My questions are:

  1. How do i write "rows-per-second" during the interval and "total rows so far"?
  2. Is there anything additional I want to keep in mind while writing large jdbc result sets to a file (or named-pipe, bulk loader, etc.) ?
  3. Does the (doall ) around the (map ) function fetch all results... making it non-lazy and potentially memory intensive?
  4. Would fixed width be possible as an option? I believe that would be faster for a named pipe to bulk loader. The trade-off would be on disk i/o in place of CPU utilization for downstream parsing. However this might require introspection on the result set returned (with .getMetaData?)

    (ns metadata.db.table-dump 
      [:use 
       [clojure.pprint]
       [metadata.db.connections] 
       [metadata.db.metadata]
       [clojure.string :only (join)]
       [taoensso.timbre :only (debug info warn error set-config!)]
       ]
      [:require
       [clojure.java.io       :as io ] 
       [clojure.java.jdbc     :as j  ]     
       [clojure.java.jdbc.sql :as sql]     
       ]
      )
    
    (set-config! [:appenders :spit :enabled?] true)
    (set-config! [:shared-appender-config :spit-filename] "log.log")
    
    (let [
          field-delim    "\t" 
          row-delim      "\n" 
          report-seconds 10 
          sql            "select * from comcast_lineup "
          joiner         (fn [v] (str (join field-delim v ) row-delim ) )
          results        (rest (j/query local-postgres  [sql ] :as-arrays? true :row-fn joiner ))  
          ]
      (with-open [wrtr (io/writer "test.txt")]
          (doall 
              (map #(.write wrtr %) 
                  ; Somehow in here i want to log with (info ) rows written so
                  ; far, and "rows per second" every 10 seconds.  
                  results )) 
        ) (info "Completed write") )
    
joefromct
  • 1,506
  • 13
  • 33

2 Answers2

1

You may get some use out of my answer to Idiomatic clojure for progress reporting?

To your situation specifically

1) You could add an index to your map as the second argument to the anonymous function, then in the function you are mapping look at the index to see what row you are writing. which can be used to update an atom.

user> (def stats (atom {}))
#'user/stats
user> (let [start-time (. (java.util.Date.) getTime)] 
         (dorun (map (fn [line index] 
                       (println line) ; write to log file here
                       (reset! stats [{:lines index 
                                       :start start-time 
                                       :end (. (java.util.Date.) getTime)}])) 
                     ["line1" "line2" "line3"] 
                     (rest (range)))))
line1
line2
line3
nil
user> @stats  
[{:lines 3, :start 1383183600216, :end 1383183600217}] 
user>  

The contents of stats can then be printed/logged every few seconds to update the UI

3) you most certainly want to use dorun instead of doall because as you suspect this will run out of memory on a large enough data set. dorun drops the results as they are written so you can run it on infinitely large data if you want to wait long enough.

Community
  • 1
  • 1
Arthur Ulfeldt
  • 90,827
  • 27
  • 201
  • 284
  • you could use [map-indexed](http://richhickey.github.io/clojure/clojure.core-api.html#clojure.core/map-indexed) instead of `map` here. – xsc Oct 31 '13 at 09:01
  • That'll be helpful when i get to progress bars... for the time being i'm just pushing a ratio/float out to a log file. Thanks. – joefromct Oct 31 '13 at 12:22
1

Couple general tips:

  • At the JDBC level you may need to use setFetchSize to avoid loading the entire resultset into RAM before it even gets to Clojure. See What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
  • Make sure clojure.java.jdbc is actually returning a lazy seq (it probably is?)-- if not, consider resultset-seq
  • doall will indeed force the whole thing to be in RAM; try doseq instead
  • Consider using an atom to keep count of rows written as you go; you can use this to write rows-so-far, etc.

Sketch:

(let [ .. your stuff ..
      start (System/currentTimeMillis)
      row-count (atom 0)]
  (with-open [^java.io.Writer wrtr (io/writer "test.txt")]
    (doseq [row results]
      (.write wrtr row)
      (swap! row-count inc)
      (when (zero? (mod @row-count 10000))
        (println (format "written %d rows" @row-count))
        (println (format "rows/s %.2f" (rate-calc-here)))))))
Community
  • 1
  • 1
overthink
  • 23,985
  • 4
  • 69
  • 69