Column Copy and Update vs. Column Create and Insert

Question

I have a table with 32 Million rows and 31 columns in PostgreSQL 9.2.10. I am altering the table by adding columns with updated values.

For example, if the initial table is:

id     initial_color
--     -------------
1      blue
2      red
3      yellow

I am modifying the table so that the result is:

id     initial_color     modified_color
--     -------------     --------------
1      blue              blue_green
2      red               red_orange
3      yellow            yellow_brown

I have code that will read the initial_color column and update the value.

Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this? My present choices are:

Copy the column and update the rows in the new column
Create an empty column and insert new values

I could do either option with one column at a time or with all five at once. The columns types are either character varying or character.

UPDATE doesn't create the new column right ? We need the "Copy the column" sentence to copy the DDL. Using the second option we need to explicitly specify all the fields. You need more code to calculate that. — yucer, Apr 07 '16 at 09:00
I would think creating an empty column, selecting the initial column, mutating those values in memory, and inserting would be the most efficient. That's three commands that actually hit the actual database. You do the rest in memory, which is much faster. — steel, Apr 07 '16 at 15:20
This might help http://blog.codacy.com/2015/05/14/how-to-update-large-tables-in-postgresql/ — nmat, Apr 09 '16 at 08:49

score 3 · Accepted Answer · edited May 23 '17 at 12:02

The columns types are either character varying or character.

Don't use character, that's a misunderstanding. varchar is ok, but I would suggest just text for arbitrary character data.

Any downsides of using data type "text" for storing strings?

Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this?

If you don't have objects (views, foreign keys, functions) depending on the existing table, the most efficient way is create a new table. Something like this ( details depend on the details of your installation):

BEGIN;
LOCK TABLE tbl_org IN SHARE MODE;  -- to prevent concurrent writes

CREATE TABLE tbl_new (LIKE tbl_org INCLUDING STORAGE INCLUDING COMMENTS);

ALTER tbl_new ADD COLUMN modified_color text
            , ADD COLUMN modified_something text;
            -- , etc
INSERT INTO tbl_new (<all columns in order here>)
SELECT <all columns in order here>
    ,  myfunction(initial_color) AS modified_color  -- etc
FROM   tbl_org;
-- ORDER  BY tbl_id;  -- optionally order rows while being at it.

-- Add constraints and indexes like in the original table here

DROP tbl_org;
ALTER tbl_new RENAME TO tbl_org;
COMMIT;

If you have depending objects, you need to do more.

Either was, be sure to add all five at once. If you update each in a separate query you write another row version each time due to the MVCC model of Postgres.

Related cases with more details, links and explanation:

While creating a new table you might also order columns in an optimized fashion:

Calculating and saving space in PostgreSQL

This is probably your best choice, updates are expensive in Postgres — nmat, Apr 09 '16 at 08:51

score 2 · Answer 2 · answered Apr 07 '16 at 14:46

2

Maybe I'm misreading the question, but as far as I know, you have 2 possibilities for creating a table with the extra columns:

CREATE TABLE
This would create a new table and filling could be done using
- CREATE TABLE .. AS SELECT.. for filling with creation or
- using a separate INSERT...SELECT... later on Both variants are not what you seem to want to do, as you stated solution without listing all the fields.
  Also this would require all data (plus the new fields) to be copied.
ALTER TABLE...ADD ...
This creates the new columns. As I'm not aware of any possibility to reference existing column values, you will need an additional UPDATE ..SET... for filling in values.

So, I' not seeing any way to realize a procedure that follows your choice 1.

Nevertheless, copying the (column) data just to overwrite them in a second step would be suboptimal in any case. Altering a table adding new columns is doing minimal I/O. From this, even if there would be a possibility to execute your choice 1, following choice 2 promises better performance by factors.

Thus, do 2 statements one ALTER TABLE adding all your new columns in on go and then an UPDATE providing the new values for these columns will achieve what you want.

answered Apr 07 '16 at 14:46

rpy

3,953
2
20
31

The table should be the same. So you are right: option 1 doesn't apply. The problem of otion 2 is that I need to provide explicitly the data type, collation, etc... I should not need to do that if they would have a COPY COLUMN sentence in alter table. In that case they could also optimize the copy of the data at the physical level because the operation is well defined. Another option is to allow UPDATE .. SET... to create the new column if it doesn't exist. – yucer Apr 09 '16 at 14:03
Yes, but explicittly stating all column attributes should really only be an issue if you do not know the table structure beforehand, eg. if you are given a table name and the information to form a new column along column 3 of that table and a function to call to get the resulting value. If you know at progrramming time, it is mybe a bit annoying but not a true problem. For the rest this is probably a feature request to the postgres people or the SQL standardisation commitee. – rpy Apr 09 '16 at 15:41
Well, I recognize it's weird to consider this at programming time. Given that DDL sentences are more used from management tools. But sometimes are software that dynamically alters the structure of its database. This is my case, the column name is a parameter. Anyway, the need to specify the column for this operation is like data redundancy ... it's prone to the error of mistaking column creation parameters even when using management tools. – yucer Apr 10 '16 at 13:42
True, there is no easy support for dynamically referring to DB meta data. You will need to use what the specific data base is providing. With pg you will need to read the pg_* tables and retrieve the information details and form the alter statements from this information. On the other hand modifying schemas is not considered a standard dynamic operation that needs to be applied to arbitrary databases anytime. Changing schema is more expected to be a planned operation that happens not really often. – rpy Apr 10 '16 at 19:14

score 0 · Answer 3 · answered Apr 08 '16 at 04:41

create new column (modified colour), it will have a value of NULL or blank on all records,

run an update statement, assuming your table name is 'Table'.

update table
set modified_color = 'blue_green'
where initial_color = 'blue'

if I am correct this can also work like this

update table set modified_color = 'blue_green' where initial_color = 'blue';
update table set modified_color = 'red_orange' where initial_color = 'red';
update table set modified_color = 'yellow_brown' where initial_color = 'yellow';

once you have done this you can do another update (assuming you have another column that I will call modified_color1)

update table set 'modified_color1'= 'modified_color'

I am not sure if this work on others DBMS. But postgresql says something like: unknown column "modified_color" The new column must be previously created. — yucer, Apr 09 '16 at 13:48

Column Copy and Update vs. Column Create and Insert

3 Answers3

Linked