Best way to update 3-4 columns in a table of 30 million records everyday

Question

I have a table which has around 25-30 million rows and 15 columns. most of them are static or gets realtime updates. But for some of the columns(3 to 4) I get the data from hdfs once in a day and updates it which means these colums have to get updated once in a day .

I tried creating a new table from the hdfs output and doing join on the main table but that seems to be taking forever. I tried doing batch updates to it but that too is taking like too long.

What is the best way to do this ?

I am using postgres as my db

If you want to update data, first you decide on the database you are using, and tag the question appropriately. I removed the extraneous database tags. — Gordon Linoff, May 12 '16 at 03:29
@GordonLinoff edited the quesion. I have tagged other relational dbs also so that people with similar experience and using some other rdbms and not postgres can also give inputs here — Peter, May 12 '16 at 04:30
@Peter problems and solutions of this nature tend to be RDMS specific but may also be of use to other DBs. If you want generec solution for all just use sql tag. It would also be impossible to find one "best way" for all. — Alex, May 12 '16 at 04:52
@Alex I feel this is a very generic problem for all the RDBMSes. Why i tagged other dbs is because some expert might be following sql-server and not sql, So i might miss his opinion. I hope I am clear — Peter, May 12 '16 at 05:12
How many rows need to be updated and how long it takes and on what hardware. On a side note you may get better luck at http://dba.stackexchange.com/. — Alex, May 12 '16 at 05:41
25 million rows , 30 gig RAM , 4 core. I am waiting since last 15 hours to it to complete — Peter, May 12 '16 at 07:24
OK, this seems like way too long. Please update your question with table scema (table create script) and indexes that you have on this table, details of the update method you are currently using. What storge configuration you are using (numbers and types of hard drives) to store your database. — Alex, May 12 '16 at 08:16
If most of the columns are static and only a few are updated frequently updated, you *could* try splitting them into two tables. This will minimise the IO traffic needed for future updates. ALSO: updating a few M records in a few tens of M records should not take more than a few minutes, given correct table structure. Please add your table definition(s) to the question. — joop, May 12 '16 at 08:29

score 2 · Answer 1 · edited May 23 '17 at 12:07

2

Best way to improve a performance of update/insert query is by using a Bulk Insert/Update

Please have a look at posts below which will help you:

MySQL bulk INSERT or UPDATE

Does splitting up an update query improve performance

edited May 23 '17 at 12:07

Community

1
1

answered May 12 '16 at 04:56

shankarsh15

1,947
1
11
16

Before anyone downvotes this answer, please note that the original question had every imaginable SQL tag attached to it, so this was a fair answer at the time (though a quote from the linked page would have made it better) – e4c5 May 12 '16 at 05:00
I have mentioned in the question even the batch updates are tking very very long – Peter May 12 '16 at 05:10

score 0 · Answer 2 · answered May 12 '16 at 18:41

I would try something like creating a new table using CREATE TABLE AS ... command , using a join between the new table and the table you are creating with the batch data, instead of updating the old table.

http://www.postgresql.org/docs/9.4/static/sql-createtableas.html

If you place that inside a transaction, you can place it in a function, and after that you can just rename the tables to switch them or just remove the old one and rename the new one.

I have seen processes where this strategy works like a clock for batch jobs (creating a new table being faster than updating the old one).

You will need to test it and see if that suits for your tables, and also look at the indexes you have.

Best of luck!

note: you will also have to add the indexes and constraints to the new table as the create table as will not include them.

I created an updatable view on top of these 2 tables , It seems to work good. — Peter, May 13 '16 at 10:50

Best way to update 3-4 columns in a table of 30 million records everyday

2 Answers2