Return data from subselect used in INSERT in a Common Table Expression

Question

I am trying to move bytea data from one table to another, updating references in one query.

Therefore I would like to return data from the query used for the insert that is not used for the insert.

INSERT INTO file_data (data)
  select image from task_log where image is not null
RETURNING id as file_data_id, task_log.id as task_log_id

But I get an error for that query:

[42P01] ERROR: missing FROM-clause entry for table "task_log"

I want to do something like:

WITH inserted AS (
  INSERT INTO file_data (data)
    SELECT image FROM task_log WHERE image IS NOT NULL
  RETURNING id AS file_data_id, task_log.id AS task_log_id
)
UPDATE task_log
SET    task_log.attachment_id = inserted.file_data_id,
       task_log.attachment_type = 'INLINE_IMAGE'
FROM   inserted
WHERE  inserted.task_log_id = task_log.id;

But I fail to get all data used for the insert, I can't return the id from the subselect.

I was inspired by this answer on how to do that with Common Table Expressions but I can't find a way to make it work.

remove the table-alias for the *L-value* in UPDATE statements: `SET task_log.attachment_id = ...` -->> `SET attachment_id = ...` , etc — joop, Nov 09 '17 at 13:12
@ErwinBrandstetter Yes! Thanks for your help :-) I posted my solution as an answer as I resorted to add a temp column to keep the duplicates. If you wish you (or I) can add it to your answer to keep one single answer with all those possible solutions. — flob, Nov 10 '17 at 09:26
Your separate answer is just fine. And I would not want this course of action in my answer anyway. Performance is much worse than for the method using a serial number I proposed in the linked answer. Adding & removing a column require's owner's privileges, a full table rewrite and exclusive locks on the table, which is poison for concurrent access. :) — Erwin Brandstetter, Nov 10 '17 at 13:57
Hehe, ok, I added that as a warning to my answer, citing your words. Hope that is Ok :-) — flob, Nov 10 '17 at 17:20

Erwin Brandstetter · Answer 1 · 2017-11-09T14:04:51.907

2

You need to get your table names and aliases right. Plus, the connection between the two tables is the column image (datain the new table file_data):

WITH inserted AS (
  INSERT INTO file_data (data)
  SELECT image
  FROM   task_log
  WHERE  image IS NOT NULL
  RETURNING id, data  -- can only reference target row
)
UPDATE task_log t
SET    attachment_id = i.id
     , attachment_type = 'INLINE_IMAGE'
FROM   inserted i
WHERE  t.image = i.data;

Like explained in my old answer you referenced, image must be unique in task_log for this to work:

Insert data and set foreign keys with Postgres

I added a technique how to disambiguate non-unique values in the referenced answer. Not sure if you'd want duplicate images in file_data, though.

In the RETURNING clause of an INSERT you can only reference columns from the inserted row. The manual:

The optional RETURNING clause causes INSERT to compute and return value(s) based on each row actually inserted (...) However, any expression using the table's columns is allowed.

Bold emphasis mine.

Fold duplicate source values

If you want distinct entries in the target table of the INSERT (task_log), all you need in this case is DISTINCT in the initial SELECT:

WITH inserted AS (
  INSERT INTO file_data (data)
  SELECT DISTINCT image  -- fold duplicates
  FROM   task_log
  WHERE  image IS NOT NULL
  RETURNING id, data  -- can only reference target row
)
UPDATE task_log t
SET    attachment_id = i.id
     , attachment_type = 'INLINE_IMAGE'
FROM   inserted i
WHERE  t.image = i.data;

The resulting file_data.id is used multiple times in task_log. Be aware that multiple rows in task_log now point to the same image in file_data. Careful with updates and deletes ...

edited Nov 09 '17 at 14:04

answered Nov 09 '17 at 13:50

Erwin Brandstetter

605,456
145
1,078
1,228

So it's not possible to piggy back the task_log.id onto the result from the insert? – flob Nov 09 '17 at 13:55
Not out of the box. I added more. – Erwin Brandstetter Nov 09 '17 at 14:02
The images are not unique but I am trying to reference them individually.. I tried your other example with row_number, but I don't know what you eat for breakfast to get your brain up to that insight into postgres :-) – flob Nov 09 '17 at 14:03
If I use your updated query I think I should add a `select **distinct** image from task_log` to not add not used entries. ? – flob Nov 09 '17 at 14:04
@flob: Yes, to fold dupes, `DISTINCT` (or `GROUP BY`) in the initial `SELECT` is all you need. Be aware of implications, though. I added more. – Erwin Brandstetter Nov 09 '17 at 14:10
Great :-) Thanks a lot! – flob Nov 09 '17 at 14:10
I am currently trying to avoid that complications by adding a temp column and removing it in the same transaction, I don't know how the implications might be for a big db but as ours is quite small (less than 20k images). `alter table file_data add column task_log_id bigint;` before, adding that id in the insert `INSERT INTO file_data (data, task_log_id) select image, id from task_log where image is not null` and then using it in the update `where inserted.task_log_id = task_log.id;`. It works but has a quite bad smell ... :-/ – flob Nov 09 '17 at 14:14
@flob: Start by getting the requirements straight: Do you want to allow duplicate values of `data` in `file_data`? I.e.: do you want to store images redundantly if used in multiple rows in `task_log`? Typically, the answer is *no*, in which case you have to fold dupes. Plus you might need to do more to avoid dupes between newly inserted and existing rows in `file_data`. See: https://stackoverflow.com/a/30878260/939860 – Erwin Brandstetter Nov 09 '17 at 14:20
For now I need those duplicates due to the application but later I should add deduplication :-) Thanks! – flob Nov 09 '17 at 14:40

flob · Answer 2 · 2017-11-10T17:20:04.923

I needed to replicate duplicates so I ended up adding a temp column for the id of the used data row.

alter table file_data add column task_log_id bigint;
-- insert & update data
alter table file_data drop column task_log_id;

The full move script was

-- A new table for any file data
CREATE TABLE file_data (
  id         BIGSERIAL PRIMARY KEY,
  data  bytea
);

-- Move data from task_log to bytes

-- Create new columns to reference file_data
alter table task_log add column attachment_type VARCHAR(50);
alter table task_log add column attachment_id bigint REFERENCES file_data;

-- add a temp column for the task_id used for the insert
alter table file_data add column task_log_id bigint;

-- insert data into file_data and set references
with inserted as (
  INSERT INTO file_data (data, task_log_id)
    select image, id from task_log where image is not null
  RETURNING id, task_log_id
)
UPDATE task_log
SET   attachment_id = inserted.id,
      attachment_type = 'INLINE_IMAGE'
FROM  inserted
where inserted.task_log_id = task_log.id;
-- delete the temp column
alter table file_data drop column task_log_id;
-- delete task_log images
alter table task_log drop column image;

As this produces some dead data I ran a vacuum full afterwards to clean up.

But please let me repeat the warning from @ErwinBrandstetter:

Performance is much worse than for the method using a serial number I proposed in the linked answer. Adding & removing a column require's owner's privileges, a full table rewrite and exclusive locks on the table, which is poison for concurrent access.

Return data from subselect used in INSERT in a Common Table Expression

2 Answers2

Fold duplicate source values

Linked