2

I have been practising bulk load using Laurenz Albe' s blog https://www.cybertec-postgresql.com/en/postgresql-bulk-loading-huge-amounts-of-data/. The problem start when I changed it a little bit. The table is the same:

CREATE TABLE t_sample
(
     a              varchar(50),
     b              int,
     c              varchar(50),
     d              int
);

Instead of

BEGIN;
INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);
INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);
INSERT INTO t_sample VALUES ('abcd', 1, 'abcd', 1);
…
COMMIT;`

I use

BEGIN;
INSERT INTO t_sample VALUES
('abcd', 1, 'abcd', 1),
('abcd', 1, 'abcd', 1),
('abcd', 1, 'abcd', 1),
…
COMMIT;

For those who do not want to read the blog: I try to compare memory consumption between COPY and insert method.

Also, instead of 1 m records I used 2097152 records. I keep this commands as a file, multipleinsert.sql and run it like psql -d load_test -f multipleinsert.sql.

Also, as stated in the blog I used COPY method as following (of course with the same number of records, 2097152)

COPY t_sample FROM stdin;
abcd 1 abcd 1
abcd 1 abcd 1
abcd 1 abcd 1
...

The execution time for insert 14.543s and for COPY is 1.237s. Yet, this is not the point I want to compare. When I use COPY the available memory on the server nearly does not decrease, but during execution of the insert statement the available memory decreases nearly 6,5 GB. The file that contains insert statement itself is nearly 49 MB, so PostgreSQL tries to cache it, but why does it scale to ~6,5GB? Is there a calculation method for (roughly) calculating memory that PostgreSQL will consume for such a loading?

Note: How does COPY work and why is it so much faster than INSERT?, stated well why COPY is faster than insert but not this one nor any other post over the internet did not shed a light over memory consumption.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
Umut TEKİN
  • 856
  • 1
  • 9
  • 19
  • If you could provide a Dockerfile with full repro steps (starting from default postgresql image with fixed version), I gonna start a small bounty on this question, because this behaviour seems very much unexpected for me, and existing answer does not provide an actual answer. I'd really want to learn the reasons behind, and maybe filie a bug to postgres. – STerliakov Jun 12 '23 at 18:54

1 Answers1

0

The multi-line INSERT is processed differently from COPY.

  • With the INSERT, a large statement is sent from the client, parsed and planned on the server and then executed. The whole statement is kept in memory on the client and on the server, and the whole big thing has to be parsed and planned, so I am not surprised if that uses a lot of memory.

    However, I cannot explain why a 50MB statement would temporarily need 6.5 GB RAM. That seems excessive.

  • The COPY statement itself is small, and the data are streamed from client to server and inserted into the table as they arrive. COPY can batch several rows for a bulk insert for performance, but the batch will be smaller than your two million rows, so it will use way less memory.

To debug why so much memory is used, disable memory overcommit on the machine, then let the INSERT run until it is out of memory. PostgreSQL will write a memory context dump to the log file, which will tell you where the memory is allocated.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Could you elaborate a bit on the first point? OP said the file is less than 50 MB, so from your explanation seems like it should use roughly 150 MB (two query copies with quotes, parens, etc.) plus some planning and parsing overhead. Does that "overhead" prevail over actual data size? Why? Using 6.5 GB mem to process a 50 MB data sample looks a bit more than one might expect... – STerliakov Apr 21 '23 at 09:44
  • @SUTerliakov I did not see that the discrepancy was that high. No, I have no explanation for that. – Laurenz Albe Apr 21 '23 at 10:16
  • It is not the only case. On this case the table has 4 columns and multiple insert statement file is ~49 MB. Yet, I have another table with 30 columns and I prepared same test for it with exactly 1 million records. The file itself was around 670MB and when I execute it I have to terminate session because it will eat all the memory of the server unless I kill it. By the way server has 64 GB memory and 32 GB swap. – Umut TEKİN Apr 21 '23 at 11:15
  • So, as other database systems do, PostgreSQL needs to parse it but there should be an explanation for using this much memory or any kind of calculation. For the bulk load, yes, this is not the best approach. We need to divide data into smaller pieces like 5k or 10k, but that is not the case for my question :). – Umut TEKİN Apr 21 '23 at 11:22
  • Isn' t there any other option to wait PostgreSQL to use all memory up :S? – Umut TEKİN Apr 24 '23 at 12:54
  • Debugging memory problems is never trivial. – Laurenz Albe Apr 24 '23 at 16:52