Best logic to handle big file + pgsql tables

Question

I have a postgres database with multiple schemas, each schemas with multiple tables, each tables with few millions rows.

From time to time, I have to download a file containing a few millions lines and compare each line with every row in all tables, if the row is find I must update a column in It.

The first thing I tried was reading the file line by line, running a select query on each table in each schema, and if the rows is found i run the update, it works good on my testing platform, but with the actual database, it will be running forever since it's executing around 1000 queries per second (I checked out with the query: SELECT xact_commit+xact_rollback FROM pg_stat_database WHERE datname = 'mailtng_lists';).

The second thing I tried, is separate the main script from the script that connects to the database, so what I did was splitting the huge file in chunks, each chunk with 100K lines, and then calling the script that will do the connection X times with the following command:

foreach($chunks as $chunk) //$chunks is the result of the split command (no problem here)
{
    exec("updater.php $chunk");
}

But it did no improvements at all, number of queries per second was still very low, so the last thing I tried, was doing the same but with shell_exec so the script wouldn't have to wait for the output, but the server crashed since I had 173 chunks so It resulted in calling 173 php instances.

Any idea on how to handle this situation?

I would recommend to use [procedures](https://stackoverflow.com/q/12831667) in your db completely. It‘s definitely faster than doing the same with php and there is a good chance that you do not need to split up the lines anymore to check them. — Blauharley, May 27 '18 at 16:37
Excellent idea, I did not think of that, I will give it a try this evening and return to you. Thank you — Soufiane Touil, May 27 '18 at 16:41
Such an operation will *always* be painful; you should restrict the number of tables you have to update as much as possible. I'd loop through all the tables in question and for each perform an `UPDATE` that joins the table of new values. — Laurenz Albe, May 28 '18 at 04:27
@Blauharley The performance increased very slightly with the use of procedures, I had to change complete logic of the application to reduce the update time. Thanks for your help — Soufiane Touil, May 28 '18 at 13:24
@LaurenzAlbe Unfortunately, restricting number of tables is not an option, as for the loop that's what i'm doing and is very slow. — Soufiane Touil, May 28 '18 at 13:25
@SoufianeTouil: You are welcome, unfortunately it's always an issue with us as well. as the db-size of a web-app increases. Aside from the primary-key, you might take advantage of a self-defined indexes as well to speed up the actual search. Or you use additional tables that holds the "most relevant" information of several tables.... I wish you good luck! — Blauharley, May 28 '18 at 13:35

Best logic to handle big file + pgsql tables

0 Answers0