The fast lane: create a temporary table matching the structure of the CSV file (possibly using an existing table as template for convenience) and use COPY
:
Bulk load
CREATE TEMP TABLE tmp(email text);
COPY tmp FROM 'path/to/file.csv';
ANALYZE tmp; -- do that for bigger tables!
I am assuming emails in the CSV are unique, you did not specify. If they are not, make them unique:
CREATE TEMP TABLE tmp0
SELECT DISTINCT email
FROM tmp
ORDER BY email; -- ORDER BY cheap in combination with DISTINCT ..
-- .. may or may not improve performance additionally.
DROP TABLE tmp;
ALTER TABLE tmp0 RENAME TO tmp;
Index
For your particular case a unique index on email is in order.
It is much more efficient to create the index after loading and sanitizing the data. This way you also prevent COPY
from bailing out with a unique violation if there should be dupes:
CREATE UNIQUE INDEX tmp_email_idx ON tmp (email);
On second thought, if all you do is update the big table, you don't need an index on the temporary table at all. It will be read sequentially.
Yes DB table is indexed using primary key.
The only relevant index in this case:
CREATE INDEX tbl_email_idx ON tbl (email);
Make that CREATE UNIQUE INDEX ...
if possible.
Update
To update your table as detailed in your later comment:
UPDATE tbl t
SET ...
FROM tmp
WHERE t.email = tmp.email;
All of this can easily be wrapped into a plpgsql or sql function.
Note that COPY
requires dynamic SQL with EXECUTE
in a plpgsql function if you want parameterize the file name.
Temporary tables are dropped at the end of the session automatically by default.
Related answer:
How to bulk insert only new rows in PostreSQL