Moving 5,000,000 rows to another Postgresql DBs by Clojure & JDBC

Question

I am trying to move 5,000,000 rows from one Postgre DB to another one. Both connections are in Hikari CP connection pool.

I went through a lot of documentation and posts. It left me with the code bellow. But it is not really usable:

(jdbc/with-db-connection [tx {:datasource source-db}]
  (jdbc/query tx
      [(jdbc/prepare-statement (jdbc/get-connection tx)
                                answer-sql
                                {:fetch-size 100000})]
                  {:result-set-fn (fn [result-set]
                                    (jdbc/insert-multi!
                                     {:datasource target-db}
                                     :migrated_answers
                                     result-set))}))

I've tried a lot of little different forms of this. jdbc/with-db-transaction or any other I can think of didn't help much.

A lot of tutorials and posts mention only the way how to process the result as a whole. It is absolutely ok with small tables that get in RAM but and it seems fast. But this is not the case.
So when I properly use :fetch-size and my RAM doesn't explode (hocus pocus) than the transfer IS very slow with both connections switching between 'active' and 'idle in transaction' states on DB sides. Ive never waited for so long to find any of the data actually transferred!

When I create this simple batch in Talend Open Studio (ETL tool generating Java code) it transfers all the data in 5 minutes. And the cursor-size is "also" set to 100000 there. I think that Clojure's clean code should be faster.

The fastest result that I've got was with this code below. I think it is because the :as-array parameter. If I don't use :max-rows parameter memory explodes because it is not processed lazily, so I can't use this for the whole transfet. Why? I don't understand the rules here.

(jdbc/with-db-transaction [tx {:datasource source-db}]
  (jdbc/query tx
              [(jdbc/prepare-statement (:connection tx)
                                        answer-sql
                                       {:result-type :forward-only
                                        :concurrency :read-only
                                        :fetch-size 2000
                                        :max-size 250000})]
              {:as-arrays? true
               :result-set-fn (fn [result-set]
                                (let [keys (first result-set)
                                      values (rest result-set)]
                                  (jdbc/insert-multi! 
                                     {:datasource dct-db}
                                      :dim_answers
                                       keys values)))}))

I will appreciate any help or info that I am clearly missing.

This question may help: https://stackoverflow.com/questions/39765943/clojure-java-jdbc-lazy-query/39782071#39782071 — Alan Thompson, Feb 07 '18 at 04:43
And here: https://stackoverflow.com/questions/19728538/clojure-java-jdbc-query-large-resultset-lazily — Alan Thompson, Feb 07 '18 at 04:44
In theory, I think the optimal approach is to connect a pair of `pgjdbc` [`CopyManager`](https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/copy/CopyManager.html)s via piped [input](https://docs.oracle.com/javase/8/docs/api/java/io/PipedInputStream.html) and [output](https://docs.oracle.com/javase/8/docs/api/java/io/PipedOutputStream.html) streams, with each running in its own thread. No idea what this looks like in Clojure, though... — Nick Barnes, Feb 07 '18 at 11:22
Have you seen Talend's code that is generated by this operation? It doesnt seem very optimized to me. But i am open to any suggestion to clarify this statement :) — Akiz, Feb 07 '18 at 13:52
The subjective cleanliness of code in general has nothing to do with its performance (the reverse isn't true either!). — Mark Rotteveel, Feb 07 '18 at 15:52

score 7 · Answer 1 · answered Feb 07 '18 at 06:08

I think the key observation here is that while your query is lazily streaming results from one DB, your insert is just one giant write to the other DB. With regard to memory usage, I don't think it makes much difference whether you've streamed the query results lazily or not if you're collecting all those results (in-memory) for a single write operation at the end.

One way to balance memory usage with throughput is to batch the writes:

(db/with-db-transaction [tx {:datasource source-db}]
  (db/query tx
    [(db/prepare-statement (:connection tx)
                           answer-sql
                           {:result-type :forward-only
                            :concurrency :read-only
                            :fetch-size 2000})]
    {:as-arrays? true
     :result-set-fn (fn [result-set]
                      (let [keys (first result-set)
                            values (rest result-set)]
                        (doseq [batch (partition-all 2000 values)]
                          (db/insert-multi! {:datasource dct-db}
                                            :dim_answers
                                            keys
                                            batch))))}))

The difference is this uses partition-all to insert values in batches (the same size as :fetch-size but I'm sure this could be tuned). Compare the performance/memory usage of this approach with the other by setting JVM max heap size to something like -Xmx1g. I couldn't get the non-batched version to complete using this heap size.

I was able to migrate 6 million small-ish rows between local PostgreSQL DBs on my laptop in ~1 minute and with java using <400MB memory. I also used HikariCP.

If you do insert in batches, you may want to consider wrapping all the inserts in single transaction if it suits your use case. I left the additional transaction out here for brevity.

If i dont use :max-size parameter memory explodes

I can't find any reference (besides a spec) to this option in the latest clojure.java.jdbc, and it didn't affect my testing. I do see a :max-rows but surely you don't want that.

I think it is because the :as-array parameter.

I would expect this to be beneficial to memory usage; the row vectors should be more space-efficient than row maps.

Thank you! Your solution works very well! I still have to try any optimization - and if I (or anybody else) will find some tricks, i will publish them here. And I am sorry! I meant :max-rows, not :max-size before. I just limited the output so i could get the results and watch the speed. Amazing, thank you! — Akiz, Feb 07 '18 at 11:02
Ive got one question still... How to preserve transaction and not commit every 2000 values but just in the end of transaction? Auto-commit off in connection parameters doesn´t help here. — Akiz, Feb 23 '18 at 07:38
@Akiz try wrapping your insert `doseq` with `(db/with-db-transaction [insert-tx {:datasource dct-db}] ...)`, then using `(:connection insert-tx)` for the `insert-multi!` connection. — Taylor Wood, Feb 23 '18 at 11:57
This will create one connection for every 2000 rows so i had to connect to the database as a SU and kill my connections (because cider on windows doesnt close connection properly), so this is not solution. But thank you — Akiz, Mar 01 '18 at 07:48
Ive found the solution i think. Wrapping transaction (that is inserting the data, not reading like before) with another 'with-db-transaction [read-tx {:datasource dag-db}]' that is reading the data. I dont really understand why this is the only way how i manage to get batch processing without commit but i am happy. This is also faster than the original solution as it commits in the end. — Akiz, Mar 01 '18 at 18:20

Akiz · Accepted Answer · 2018-03-01T19:08:57.110

This solution works best for me and it also seems faster than Taylor's solution. But huge thank for helping me.

It doesnt commit until the transaction is done. I have to experience any problems yet to see if I wont have to pimp it but i am happy for now. I've tried to replace first transaction with with-db-connection but it makes the records load straight into RAM.

(defn data->transfer2 [sql table]
     (jdbc/with-db-transaction [read-tx {:datasource dag-db}]
     (jdbc/with-db-transaction [tx {:datasource dct-db}]
        (jdbc/query read-tx
                  [(jdbc/prepare-statement (:connection read-tx)
                                           answer-sql
                                           {:result-type :forward-only
                                            :concurrency :read-only
                                            :fetch-size 100000})]
                  {:as-arrays? true
                   :result-set-fn (fn [result-set]
                                    (let [keys (first result-set)
                                          values (rest result-set)]
                                      (doseq [btch (partition-all 100000 values)]
                                        (jdbc/insert-multi! tx
                                                            :dim_answers
                                                             keys
                                                             btch))))})))

Moving 5,000,000 rows to another Postgresql DBs by Clojure & JDBC

2 Answers2