How to get fast result from a SQL query

Question

I have a Postgre SQL database table which contains over 5 million entries. Also have a CSV file which contains 100,000 entries.

I need to run a query to get data from DB which related to the CSV file's data.

However as per the everyone's understanding and with my own experience, This kind of query takes ages to get completed. (more than 6 hours, as per my guess)

So as per the newest findings and tools, do we have a better, fast solution to perform this same task?

you need to provide more information, how is the data structured in the DB and CSV file? — Björn Mårtensson, Aug 22 '13 at 07:43
DB is a customer detail table. Which has name, etc, etc, email. CSV file has only email field. I need to query the DB with matching email addresses from CSV and need to update respective DB entry with some data. Yes DB table is indexed using primary key — inckka, Aug 22 '13 at 07:50
Consider editing the question. Essential information should go into the question, not comments. — Erwin Brandstetter, Aug 22 '13 at 13:15
I think @ErwinBrandstetter answer should be accepted, mine was just an addition — Roman Pekar, Aug 22 '13 at 13:38

score 6 · Answer 1 · edited May 23 '17 at 12:24

The fast lane: create a temporary table matching the structure of the CSV file (possibly using an existing table as template for convenience) and use COPY:

Bulk load

CREATE TEMP TABLE tmp(email text);

COPY tmp FROM 'path/to/file.csv';
ANALYZE tmp;                       -- do that for bigger tables!

I am assuming emails in the CSV are unique, you did not specify. If they are not, make them unique:

CREATE TEMP TABLE tmp0
SELECT DISTINCT email
FROM   tmp
ORDER  BY email;  -- ORDER BY cheap in combination with DISTINCT ..
                  -- .. may or may not improve performance additionally.

DROP TABLE tmp;
ALTER TABLE tmp0 RENAME TO tmp;

Index

For your particular case a unique index on email is in order. It is much more efficient to create the index after loading and sanitizing the data. This way you also prevent COPY from bailing out with a unique violation if there should be dupes:

CREATE UNIQUE INDEX tmp_email_idx ON tmp (email);

On second thought, if all you do is update the big table, you don't need an index on the temporary table at all. It will be read sequentially.

Yes DB table is indexed using primary key.

The only relevant index in this case:

CREATE INDEX tbl_email_idx ON tbl (email);

Make that CREATE UNIQUE INDEX ... if possible.

Update

To update your table as detailed in your later comment:

UPDATE tbl t
SET    ...
FROM   tmp 
WHERE  t.email = tmp.email;

All of this can easily be wrapped into a plpgsql or sql function.
Note that COPY requires dynamic SQL with EXECUTE in a plpgsql function if you want parameterize the file name.

Temporary tables are dropped at the end of the session automatically by default.
Related answer:
How to bulk insert only new rows in PostreSQL

+1 on that. I'd say there's no way query should take 6 hours!! to complete, I think with copy data to temp table and simple join it should work fast if there're proper indexed in db — Roman Pekar, Aug 22 '13 at 07:55

score 2 · Accepted Answer · answered Aug 22 '13 at 08:15

2

Just a small addition to Erwin answer - if you want just check if email in csv file, the code could be something like this:

create temp table tmp_emails (email text primary key);

copy tmp_emails from 'path/emails.csv';
analyze tmp_emails;

update <your table> set
    ...
from <your table> as d
where exists (select * from tmp_emails as e where e.email = d.email);

I think may be it's possible to create table-returning function which reads your csv and call it like:

update <your table> set
    ...
from <your table> as d
where exists (select * from csv_func('path/emails.csv') as e where e.email = d.email);

But I have no postgresql installed here to try, I'll do it later

answered Aug 22 '13 at 08:15

Roman Pekar

107,110
28
195
197

We wouldn't need to add another instance of `` in the UPDATE statement ... – Erwin Brandstetter Aug 22 '13 at 13:37
@ErwinBrandstetter always forgetting about that, SQL Server doesnt' allow alias updated table. I also want to check if it's possible to make csv reading function, will try it at home. Don't know why my answer is accepted, I hope OP will change that – Roman Pekar Aug 22 '13 at 13:41

score 0 · Answer 3 · answered Aug 22 '13 at 07:58

If I understand you correctly, you have CSV file with some field, contains KEY, which is used to search through your PostgreSQL table.

I don't know what programming language you can use for this task, but, in general, you have to solve speed problems:

First method, programming:

You need to load CSV file into memory, even ifyour CSV have 500 bytes per line, It would take only 100000 * 500 = 50 Megabytes of your RAM
You need to build some search index for KEY fields of CSV - for example, in PHP you can build array, with keys set to your KEY field values. In C++ you can create some kind of HASH table, which are widely presented by STD lib, other programming languages would give you their variant of it.
The table in PostgreSQL should be indexed by the field, which matches to your KEY field.
Use your csv array, loaded in memory to construct queries like "SELECT * FROM table WHERE key IN(1,2,3,4,5,6,7,8,9)" , where "1,2,3,4..." - is a part (for example, one hundred) of your KEY's form CSV

Second method, natural sql:

Create table and load CSV into it
Create index on the field, which used to search
Create index on the 5Millions' table
User JOIN to get linked tables data

The way you will choose depends on your real task. For example, in my experience, I had to make interface to load price-list into database, and before loading it actually, it was needed to show imported XLS file, with information about "current" and "new" prices, and, because of large size of XLS file, where was pagination needed, so, variant with KEY IN (1,2,3,4,5,6) suit the best.

How to get fast result from a SQL query

3 Answers3

Bulk load

Index

Update