0

I have a table which has around 25-30 million rows and 15 columns. most of them are static or gets realtime updates. But for some of the columns(3 to 4) I get the data from hdfs once in a day and updates it which means these colums have to get updated once in a day .

I tried creating a new table from the hdfs output and doing join on the main table but that seems to be taking forever. I tried doing batch updates to it but that too is taking like too long.

What is the best way to do this ?

I am using postgres as my db

Peter
  • 2,719
  • 4
  • 25
  • 55
  • 2
    If you want to update data, first you decide on the database you are using, and tag the question appropriately. I removed the extraneous database tags. – Gordon Linoff May 12 '16 at 03:29
  • @GordonLinoff edited the quesion. I have tagged other relational dbs also so that people with similar experience and using some other rdbms and not postgres can also give inputs here – Peter May 12 '16 at 04:30
  • @Peter problems and solutions of this nature tend to be RDMS specific but may also be of use to other DBs. If you want generec solution for all just use sql tag. It would also be impossible to find one "best way" for all. – Alex May 12 '16 at 04:52
  • @Alex I feel this is a very generic problem for all the RDBMSes. Why i tagged other dbs is because some expert might be following sql-server and not sql, So i might miss his opinion. I hope I am clear – Peter May 12 '16 at 05:12
  • How many rows need to be updated and how long it takes and on what hardware. On a side note you may get better luck at http://dba.stackexchange.com/. – Alex May 12 '16 at 05:41
  • 25 million rows , 30 gig RAM , 4 core. I am waiting since last 15 hours to it to complete – Peter May 12 '16 at 07:24
  • 1
    OK, this seems like way too long. Please update your question with table scema (table create script) and indexes that you have on this table, details of the update method you are currently using. What storge configuration you are using (numbers and types of hard drives) to store your database. – Alex May 12 '16 at 08:16
  • If most of the columns are static and only a few are updated frequently updated, you *could* try splitting them into two tables. This will minimise the IO traffic needed for future updates. ALSO: updating a few M records in a few tens of M records should not take more than a few minutes, given correct table structure. Please add your table definition(s) to the question. – joop May 12 '16 at 08:29

2 Answers2

2

Best way to improve a performance of update/insert query is by using a Bulk Insert/Update

Please have a look at posts below which will help you:

MySQL bulk INSERT or UPDATE

Does splitting up an update query improve performance

Community
  • 1
  • 1
shankarsh15
  • 1,947
  • 1
  • 11
  • 16
  • Before anyone downvotes this answer, please note that the original question had every imaginable SQL tag attached to it, so this was a fair answer at the time (though a quote from the linked page would have made it better) – e4c5 May 12 '16 at 05:00
  • I have mentioned in the question even the batch updates are tking very very long – Peter May 12 '16 at 05:10
0

I would try something like creating a new table using CREATE TABLE AS ... command , using a join between the new table and the table you are creating with the batch data, instead of updating the old table.

http://www.postgresql.org/docs/9.4/static/sql-createtableas.html

If you place that inside a transaction, you can place it in a function, and after that you can just rename the tables to switch them or just remove the old one and rename the new one.

I have seen processes where this strategy works like a clock for batch jobs (creating a new table being faster than updating the old one).

You will need to test it and see if that suits for your tables, and also look at the indexes you have.

Best of luck!

note: you will also have to add the indexes and constraints to the new table as the create table as will not include them.

lsilva
  • 795
  • 5
  • 5