1

I am working with an external data source (regulatory database) in form of a large CSV file (several GiB of data, ~200M rows), which I naturally import as is into a separate PostgreSQL table to run queries against it:

CREATE TABLE foo (id serial primary key, bar integer, baz integer, UNIQUE (bar, baz));
COPY foo (bar, baz) FROM 'input.csv' WITH (FORMAT CSV, HEADER TRUE);

This file is updated every week or so, with a very small number of rows actually being changed (added or deleted).

Is there a way to synchronize the table with the updated CSV that will be more efficient than redoing the above and then atomically renaming the table?

intelfx
  • 2,386
  • 1
  • 19
  • 32
  • 1
    Is it sorted data? What if you diff this week’s csv with last week’s, and then generate update statements? – richyen Nov 01 '19 at 04:10
  • Another option would be to use file_fdw and create a foreign table, then do what you wish to import, update, etc: https://www.postgresql.org/docs/current/file-fdw.htm – richyen Nov 01 '19 at 04:19
  • 1
    please refer https://stackoverflow.com/a/8910810/1770135 – pavan Nov 01 '19 at 04:54
  • What type of efficiency are you worried about? I'd probably just make them unlogged tables, so you don't have to worry about blowing out your WAL files, then do the atomic rename. My time is more valuable than a CPUs time, I'd rather not debug incremental synchronization scripts. I say "probably" because it depends on things you haven't told us, like indexes and constraints and if you can deal with table being missing for a short while after a crash; not because I am uncertain if it would work. – jjanes Nov 01 '19 at 15:50

0 Answers0