4

I would like to accelerate loading data to PostgreSQL. I started using the pgloader https://github.com/dimitri/pgloader and wanted to utilize parallel loading. I was tinkering with different parameters but I couldn't activate more than two cores on my machine (which has 32 of them). I found the documentation https://github.com/dimitri/pgloader/blob/master/pgloader.1.md and tried to set the batch options which were described there. Currently, I have these settings:

 LOAD CSV
      FROM '/home/data1_1.csv'
      --FROM 'data/data.csv'            
      INTO postgresql://:postgres@localhost:5432/test?test

      WITH truncate,  
           skip header = 0,  
           fields optionally enclosed by '"',  
           fields escaped by double-quote,  
           fields terminated by ',',
           batch rows = 100,
           batch size = 1MB,     
           batch concurrency = 64

       SET client_encoding to 'utf-8',  
           work_mem to '10000MB',  
           maintenance_work_mem to '20000 MB'
ady
  • 1,108
  • 13
  • 19

1 Answers1

3

I also got to this questions and it seems pgloader does not support yet parallel loading using batch options you mention. It is a bit confusing but official documentation explains that these settings are about memory management, not parallelism:

batch concurrency
Takes a numeric value as argument, defaults to 10. That's the number of batches that pgloader is allows to build in memory, even when only a single batch at a time might be sent to PostgreSQL.

Supporting more than a single batch being sent at a time is on the TODO list of pgloader, but is not implemented yet. This option is about controlling the memory needs of pgloader as a trade-off to the performances characteristics, and not about parallel activity of pgloader.

Alexander S
  • 116
  • 1
  • 3