What's the fastest way to do a bulk insert into Postgres?

Question

I need to programmatically insert tens of millions of records into a Postgres database. Presently, I'm executing thousands of insert statements in a single query.

Is there a better way to do this, some bulk insert statement I do not know about?

score 292 · Accepted Answer · edited Jun 21 '20 at 07:34

292

PostgreSQL has a guide on how to best populate a database initially, and they suggest using the COPY command for bulk loading rows. The guide has some other good tips on how to speed up the process, like removing indexes and foreign keys before loading the data (and adding them back afterwards).

edited Jun 21 '20 at 07:34

Kos

4,890
9
38
42

answered Apr 17 '09 at 03:57

Dan Lew

85,990
32
182
176

50

I wrote a bit more detail to elaborate in http://stackoverflow.com/questions/12206600/how-to-speed-up-insertion-performance-in-postgresql too. – Craig Ringer Feb 04 '14 at 01:03
37

@CraigRinger Wow, "a bit more detail" is the best understatement I have seen all week ;) – culix Mar 07 '14 at 07:07
Try Install-Package NpgsqlBulkCopy – Elyor Aug 29 '14 at 10:49
1

-Since indexes are also used for physical layout of the db records. Not sure if removing indexes in any database is a good idea. – Farjad Sep 02 '14 at 09:45
But your recommended , nothing in Memory!!! And if your batch size can be small number , very-very bad worked it's class :( I Try npgsql CopyIn class, because it's like as CSV formatted mapping in PG query statement's. You can try for Big Table? – Elyor Sep 09 '14 at 02:03

score 160 · Answer 2 · answered Jan 27 '15 at 10:18

160

There is an alternative to using COPY, which is the multirow values syntax that Postgres supports. From the documentation:

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

The above code inserts two rows, but you can extend it arbitrarily, until you hit the maximum number of prepared statement tokens (it might be $999, but I'm not 100% sure about that). Sometimes one cannot use COPY, and this is a worthy replacement for those situations.

answered Jan 27 '15 at 10:18

Ben Harper

2,350
1
16
15

17

Do you know how the performance of this method compares to COPY? – Grant Humphries Dec 17 '15 at 19:50
If you run into a permissions problem, before trying this, use COPY ... FROM STDIN – Andrew Scott Evans Jun 12 '17 at 23:47
1

If you're using row-level security, this is the best you can do. "COPY FROM is not supported for tables with row-level security" as of version 12. – Eloff Nov 23 '19 at 17:50
2

COPY is a lot faster than extended INSERT – hipertracker Feb 01 '20 at 23:46
Most important in this kind of procedure (raw data **ingestion**) is the transformation, **expressed in a SQL standard** (no use of exotic tools). See https://stackoverflow.com/a/62493516/287948 – Peter Krauss Aug 26 '20 at 13:51
3

The performance here for me was perfect. 370K rows in 3.291 seconds. – Sam Autrey Jan 30 '22 at 04:06
Due to the PostgreSQL wire protocol a maximum number of parameters are limited to 2 bytes (short C type) that limits it to 65535 data binding parameters. – Sergey Kuznetsov Jul 19 '23 at 15:50

Dana the Sane · Answer 3 · 2009-04-17T04:12:22.600

29

One way to speed things up is to explicitly perform multiple inserts or copy's within a transaction (say 1000). Postgres's default behavior is to commit after each statement, so by batching the commits, you can avoid some overhead. As the guide in Daniel's answer says, you may have to disable autocommit for this to work. Also note the comment at the bottom that suggests increasing the size of the wal_buffers to 16 MB may also help.

edited Apr 17 '09 at 04:12

answered Apr 17 '09 at 04:06

Dana the Sane

14,762
8
58
80

1

It is worth mentioning that the limit for how many inserts/copies you can add to the same transaction is likely much higher than anything you'll attempt. You could add millions and millions of rows within the same transaction and not run into problems. – Sumeet Jain Apr 27 '16 at 00:25
@SumeetJain Yes, I'm just remarking on the speed 'sweet spot' in terms of the number of copies/inserts per transaction. – Dana the Sane Apr 27 '16 at 18:43
Will this lock the table while the transaction is running? – Lambda Fairy Sep 12 '18 at 23:23

ndpu · Answer 4 · 2015-06-26T10:43:47.460

UNNEST function with arrays can be used along with multirow VALUES syntax. I'm think that this method is slower than using COPY but it is useful to me in work with psycopg and python (python list passed to cursor.execute becomes pg ARRAY):

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
VALUES (
    UNNEST(ARRAY[1, 2, 3]), 
    UNNEST(ARRAY[100, 200, 300]), 
    UNNEST(ARRAY['a', 'b', 'c'])
);

without VALUES using subselect with additional existance check:

INSERT INTO tablename (fieldname1, fieldname2, fieldname3)
SELECT * FROM (
    SELECT UNNEST(ARRAY[1, 2, 3]), 
           UNNEST(ARRAY[100, 200, 300]), 
           UNNEST(ARRAY['a', 'b', 'c'])
) AS temptable
WHERE NOT EXISTS (
    SELECT 1 FROM tablename tt
    WHERE tt.fieldname1=temptable.fieldname1
);

the same syntax to bulk updates:

UPDATE tablename
SET fieldname1=temptable.data
FROM (
    SELECT UNNEST(ARRAY[1,2]) AS id,
           UNNEST(ARRAY['a', 'b']) AS data
) AS temptable
WHERE tablename.id=temptable.id;

score 17 · Answer 5 · edited Mar 16 '22 at 23:57

((this is a WIKI you can edit and enhance the answer!))

The external file is the best and typical bulk-data

The term "bulk data" is related to "a lot of data", so it is natural to use original raw data, with no need to transform it into SQL. Typical raw data files for "bulk insert" are CSV and JSON formats.

Bulk insert with some transformation

In ETL applications and ingestion processes, we need to change the data before inserting it. Temporary table consumes (a lot of) disk space, and it is not the faster way to do it. The PostgreSQL foreign-data wrapper (FDW) is the best choice.

CSV example. Suppose the tablename (x, y, z) on SQL and a CSV file like

fieldname1,fieldname2,fieldname3
etc,etc,etc
... million lines ...

You can use the classic SQL COPY to load (as is original data) into tmp_tablename, them insert filtered data into tablename... But, to avoid disk consumption, the best is to ingested directly by

INSERT INTO tablename (x, y, z)
  SELECT f1(fieldname1), f2(fieldname2), f3(fieldname3) -- the transforms 
  FROM tmp_tablename_fdw
  -- WHERE condictions
;

You need to prepare database for FDW, and instead static tmp_tablename_fdw you can use a function that generates it:

CREATE EXTENSION file_fdw;
CREATE SERVER import FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE tmp_tablename_fdw(
  ...
) SERVER import OPTIONS ( filename '/tmp/pg_io/file.csv', format 'csv');

JSON example. A set of two files, myRawData1.json and Ranger_Policies2.json can be ingested by:

INSERT INTO tablename (fname, metadata, content)
 SELECT fname, meta, j  -- do any data transformation here
 FROM jsonb_read_files('myRawData%.json')
 -- WHERE any_condiction_here
;

where the function jsonb_read_files() reads all files of a folder, defined by a mask:

CREATE or replace FUNCTION jsonb_read_files(
  p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
  WITH t AS (
     SELECT (row_number() OVER ())::int id, 
           f AS fname,
           p_fpath ||'/'|| f AS f
     FROM pg_ls_dir(p_fpath) t(f)
     WHERE f LIKE p_flike
  ) SELECT id, fname,
         to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath', p_fpath),
         pg_read_file(f)::jsonb
    FROM t
$f$  LANGUAGE SQL IMMUTABLE;

Lack of gzip streaming

The most frequent method for "file ingestion" (mainlly in Big Data) is preserving original file on gzip format and transfering it with streaming algorithm, anything that can runs fast and without disc consumption in unix pipes:

 gunzip remote_or_local_file.csv.gz | convert_to_sql | psql

So ideal (future) is a server option for format .csv.gz.

Note after @CharlieClark comment: currently (2022) nothing to do, the best alternative seems pgloader STDIN:

  gunzip -c file.csv.gz | pgloader --type csv ... - pgsql:///target?foo

When I tried using FDW for a very large (> 10 GB) CSV I hit memory problems with Postges (along with handling some of the bizarre mistakes in the MySQL export) and couldn't find any way around this. — Charlie Clark, Jan 05 '22 at 12:05
@CharlieClark, PostgreSQL version? You can report the error, here or [at the PostgreSQL options](https://www.postgresql.org/docs/current/bug-reporting.html). Another solution is to split... test for example Unix `head -n 1000 file.csv > file_test.csv` to check by FDW the first 1000 lines, the problem can be a malformed CSV. PS: to fix a malformed CSV you can use [csvformat](https://csvkit.readthedocs.io/en/latest/scripts/csvformat.html) — Peter Krauss, Jan 09 '22 at 20:06
I've had the errors with different versions of Postgres but the most recent was 13. I haven't pursued other options because pgloader covers most of them, but I was surprised at the memory problems. — Charlie Clark, Jan 10 '22 at 12:02

score 12 · Answer 6 · edited May 23 '17 at 12:02

12

You can use COPY table TO ... WITH BINARY which is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.

Here is an example recipe in Python, using psycopg2 with binary input.

edited May 23 '17 at 12:02

Community

1
1

answered Nov 17 '11 at 09:33

Mike T

41,085
18
152
203

score 10 · Answer 7 · edited Jan 05 '22 at 12:02

It mostly depends on the (other) activity in the database. Operations like this effectively freeze the entire database for other sessions. Another consideration is the datamodel and the presence of constraints,triggers, etc.

My first approach is always: create a (temp) table with a structure similar to the target table (create table tmp AS select * from target where 1=0), and start by reading the file into the temp table. Then I check what can be checked: duplicates, keys that already exist in the target, etc.

Then I just do a do insert into target select * from tmp or similar.

If this fails, or takes too long, I abort it and consider other methods (temporarily dropping indexes/constraints, etc)

score 5 · Answer 8 · edited Jun 20 '20 at 23:25

5

I just encountered this issue and would recommend csvsql (releases) for bulk imports to Postgres. To perform a bulk insert you'd simply createdb and then use csvsql, which connects to your database and creates individual tables for an entire folder of CSVs.

$ createdb test 
$ csvsql --db postgresql:///test --insert examples/*.csv

edited Jun 20 '20 at 23:25

Peter Krauss

13,174
24
167
304

answered Aug 13 '15 at 15:08

Sarah Frostenson

59
1
1

1

For csvsql, in order to also clean the source csv from any possible formating errors, it is best to follow [these instructions](http://sptl.eu/2015/01/03/import-csv-data-into-postgresql-the-comfortable-way/), more documentation [here](http://csvkit.readthedocs.org/en/0.9.1/scripts/csvsql.html) – sal Nov 11 '15 at 10:17

score 4 · Answer 9 · answered Aug 29 '14 at 10:48

4

I implemented very fast Postgresq data loader with native libpq methods. Try my package https://www.nuget.org/packages/NpgsqlBulkCopy/

answered Aug 29 '14 at 10:48

Elyor

900
1
12
26

score 2 · Answer 10 · answered Jul 29 '21 at 17:23

May be I'm late already. But, there is a Java library called pgbulkinsert by Bytefish. Me and my team were able to bulk insert 1 Million records in 15 seconds. Of course, there were some other operations that we performed like, reading 1M+ records from a file sitting on Minio, do couple of processing on the top of 1M+ records, filter down records if duplicates, and then finally insert 1M records into the Postgres Database. And all these processes were completed within 15 seconds. I don't remember exactly how much time it took to do the DB operation, but I think it was around less then 5 seconds. Find more details from https://www.bytefish.de/blog/pgbulkinsert_bulkprocessor.html

I was able to insert 1M records in 3.7 seconds. – Sajidur Rahman Jun 17 '22 at 12:25 — Sajidur Rahman, Jun 17 '22 at 12:25

score 0 · Answer 11 · answered Jan 05 '22 at 12:14

As others have noted, when importing data into Postgres, things will be slowed by the checks that Postgres is designed to do for you. Also, you often need to manipulate the data in one way or another so that it's suitable for use. Any of this that can be done outside of the Postgres process will mean that you can import using the COPY protocol.

For my use I regularly import data from the httparchive.org project using pgloader. As the source files are created by MySQL you need to be able to handle some MySQL oddities such as the use of \N for an empty value and along with encoding problems. The files are also so large that, at least on my machine, using FDW runs out of memory. pgloader makes it easy to create a pipeline that lets you select the fields you want, cast to the relevant data types and any additional work before it goes into your main database so that index updates, etc. are minimal.

score -1 · Answer 12 · answered Dec 17 '22 at 11:18

The query below can create test table with generate_series column which has 10000 rows. *I usually create such test table to test query performance and you can check generate_series():

CREATE TABLE test AS SELECT generate_series(1, 10000);

postgres=# SELECT count(*) FROM test;
 count
-------
 10000
(1 row)

postgres=# SELECT * FROM test;
 generate_series
-----------------
               1
               2
               3
               4
               5
               6
-- More --

And, run the query below to insert 10000 rows if you've already had test table:

INSERT INTO test (generate_series) SELECT generate_series(1, 10000);

What's the fastest way to do a bulk insert into Postgres?

12 Answers12

The external file is the best and typical bulk-data

Bulk insert with some transformation

Lack of gzip streaming

Linked

Related