1

I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. The field to check for uniqueness is also indexed.

So, I read the file in chunks and use the IN clause to get the items already in the database.

Is there a better way of doing it?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Aman
  • 4,786
  • 1
  • 17
  • 17
  • 1
    Add proper unique indexes and just perform the inserts, ignoring duplicate errors. – user229044 Oct 06 '12 at 03:26
  • will that have better performance in large sets of data? – Aman Oct 06 '12 at 03:29
  • 1
    @meagar: You can't just "ignore duplicate errors" with set-based operations. The transaction will be rolled back and nothing gets done. Would only make sense if you `INSERT` rows one by one - which would be a terrible idea, as far as performance is concerned. – Erwin Brandstetter Oct 06 '12 at 03:47

2 Answers2

3

This should perform well:

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data

COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
LEFT   JOIN tbl USING (tbl_id)
WHERE  tbl.tbl_id IS NULL;

DROP TABLE tmp; -- else dropped at end of session automatically

Closely related to this answer.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • @TheRaaaZ: Join on enough columns to *make* it unique. *All columns* if you have to. The [`USING` or the `NATURAL` clause](http://www.postgresql.org/docs/current/interactive/sql-select.html#SQL-FROM) would come in handy for that. Normally, every table should have a primary key, though. – Erwin Brandstetter Mar 30 '14 at 14:09
  • @ervin I've columns array, can you please tell what will be the query if columns array have 5 columns col1-col5? – Raza Ahmed Mar 30 '14 at 14:14
  • @TheRaaaZ: [Start a new question](http://stackoverflow.com/questions/ask), giving a complete picture. You can always link to this one for reference. – Erwin Brandstetter Mar 30 '14 at 14:54
1

First just for completeness I changed Erwin's code to use except

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
except
select *
from tbl

DROP TABLE tmp;

Then I resolved to test it myself. I tested it in 9.1 with a mostly untouched postgresql.conf. The target table contains 10 million rows and the origin table 30 thousand. 15 thousand already exists in the target table.

create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;

I asked for the explain of the select part only. The except version:

explain
select *
from tmp
except
select *
from tbl
;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..270098.68 rows=200 width=4)
   ->  Append  (cost=0.00..245018.94 rows=10031897 width=4)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..771.40 rows=31920 width=4)
               ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..244247.54 rows=9999977 width=4)
               ->  Seq Scan on tbl  (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)

The outer join version:

explain
select *
from 
    tmp
    left join
    tbl using (id)
where tbl.id is null
;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..208142.58 rows=15960 width=4)
   ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
   ->  Index Scan using tbl_pkey on tbl  (cost=0.00..7.80 rows=1 width=4)
         Index Cond: (tmp.id = id)
(4 rows)
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260