0

This table is the (fake example) result of a big messy query.

=> WITH data(run, host, status, field_ip, control_ip, field_data, control_data) AS (
    VALUES (1, 1, 'no_control', '10.0.0.1', NULL, 'foo', NULL),
           (1, 2, 'good', '10.0.0.1', '10.0.0.1', 'bar', 'bar'),
           (1, 3, 'problem_1', '10.0.0.1', NULL, 'bar', NULL),
           (1, 3, 'problem_2', '10.0.0.2', NULL, 'baz', NULL),
           (1, 3, NULL, NULL, '192.168.1.1', NULL, 'wallace'))
SELECT * FROM data;
 run | host |   status   | field_ip | control_ip  | field_data | control_data 
-----+------+------------+----------+-------------+------------+--------------
   1 |    1 | no_control | 10.0.0.1 |             | foo        | 
   1 |    2 | good       | 10.0.0.1 | 10.0.0.1    | bar        | bar
   1 |    3 | problem_1  | 10.0.0.1 |             | bar        | 
   1 |    3 | problem_2  | 10.0.0.2 |             | baz        | 
   1 |    3 |            |          | 192.168.1.1 |            | wallace

I want to merge the "control_ip" and "control_data" fields from rows where they are NOT NULL onto rows where they are NULL, according to the following rules: Consider each group of rows with the same 'run' and 'host' values independently. Within each such group:

  1. Each row where all the fields are NOT NULL is to be output unchanged and removed from the group. (For example, the row (1,2) is to be output unchanged. It's the only row in its group, so that group is now done.)
  2. For each row where the 'field_ip' and 'field_data' fields are NOT NULL, but the 'control_ip' and 'control_data' fields are NULL, find another row in the group where the opposite is true, and copy its 'control_ip' and 'control_data' fields onto the first row. Then output the modified row and remove it from the group.
    • If 'another row where the opposite is true' does not exist, output the first row unchanged (for example, row (1,1) is to be output unchanged).
    • If there is more than one 'another row where the opposite is true', pick one arbitrarily, and discard it after copying its values. (No example of this in the sample data set.)
    • If there is only one 'another row where the opposite is true', do not discard it after copying its values. (For example, row (1,3,NULL,NULL,'192.168.1.1',NULL,'wallace') should have its control_ip/control_Data values copied onto both the (1,3,problem_1,10.0.0.1) and (1,3,problem_2,10.0.0.2) rows.)
  3. Repeat step 2 until the group has no more rows where the 'field_ip' and 'field_data' fields are NOT NULL.
  4. Discard all remaining rows in the group (which will all have NULL field_ip and field_data).

The desired output for the above sample data is:

 run | host |   status   | field_ip | control_ip  | field_data | control_data 
-----+------+------------+----------+-------------+------------+--------------
   1 |    1 | no_control | 10.0.0.1 |             | foo        | 
   1 |    2 | good       | 10.0.0.1 | 10.0.0.1    | bar        | bar
   1 |    3 | problem_1  | 10.0.0.1 | 192.168.1.1 | bar        | wallace
   1 |    3 | problem_2  | 10.0.0.2 | 192.168.1.1 | baz        | wallace

This is a similar problem to How to concatenate text from multiple rows into a single text string in SQL server?, but that answer won't work for me because GROUP BY run, host plus aggregates can only emit a single row per group, whereas I need in some cases to emit multiple rows per group. I tried some things involving GROUP BY run, host, field_ip but then the third (1,3) row gets treated as its own group, which is no good. Also, I am using PostgreSQL (12), not SQL Server, and AIUI PIVOT is a sql-server-ism.

Note: the last step of the big messy query, producing the table shown, is the result of a FULL JOIN on run, host, and field_ip = control_ip. If it would be easier to do what I am asking for by starting from the precursors to the FULL JOIN, please tell me.

What do you suggest?

zwol
  • 135,547
  • 38
  • 252
  • 361
  • What rules allow you to know that would want to discard the third (1,3) row rather than apply the 'vertical coalesce' to fill in it's `NULL` values? Similarly, for the rows where you are filling in the `NULL` values, can you **guarantee** that there will only ever be ***one*** `NOT NULL` value to choose from? Finally, will it only ever be the `control_` columns that you need to do this for? – MatBailie Apr 30 '21 at 18:23
  • @MatBailie I have added more explicit rules for what I want to the question. – zwol Apr 30 '21 at 18:54
  • I've updated my answer, and expanded the sample data by one row to demonstrate some of the behaviour. By the way, do you want a job? I can ask because this comment is ephemeral. – MatBailie Apr 30 '21 at 19:06

2 Answers2

0

Hmmm . . . the rules are not 100% clear, but this does conform to what you want to do:

select t.*
from (select t.*,
             max(field_ip) over (partition by run, host) as imputed_field_ip,
             count(*) over (partition by run, host) as cnt
      from t
     ) t
where cnt = 1 or field_ip is null;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Using MAX() OVER () can be used as a 'kind of vertical coalesce', on the assumption that there is only one NOT NULL value to pick.

I've added one line of example data below to show how I deal with there potentially being multiple values to pick from;

  • I only pick from rows where the corresponding field_ column is NULL.

Finally I exclude records where the field_ columns are both NULL.

WITH
  data(run, host, status, field_ip, control_ip, field_data, control_data)
AS
(
  VALUES (1, 1, 'no_control', '10.0.0.1', NULL, 'foo', NULL),
         (1, 2, 'good', '10.0.0.1', '10.0.0.1', 'bar', 'bar'),
         (1, 3, 'problem_0', '10.0.0.1', '192.168.2.1', 'bar', 'zaphod'),
         (1, 3, 'problem_1', '10.0.0.1', NULL, 'bar', NULL),
         (1, 3, 'problem_2', '10.0.0.2', NULL, 'baz', NULL),
         (1, 3, NULL, NULL, '192.168.1.1', NULL, 'wallace')
),
  replace_nulls AS
(
  SELECT
    run,
    host,
    status,
    field_ip,
    COALESCE(
      control_ip,
      MAX(CASE WHEN field_ip IS NULL THEN control_ip END) OVER (PARTITION BY run, host)
    )
      AS control_ip,
    field_data,
    COALESCE(
      control_data,
      MAX(CASE WHEN field_data IS NULL THEN control_data END) OVER (PARTITION BY run, host)
    )
      AS control_data
  FROM
    data
)
SELECT
  replace_nulls.*
FROM
  replace_nulls
WHERE
  field_ip IS NOT NULL
  OR
  field_data IS NOT NULL
;

Demo : https://dbfiddle.uk/?rdbms=postgres_13&fiddle=204569771c9affb372b971e1e8740b80

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • Here's a slightly more generalised example : https://dbfiddle.uk/?rdbms=postgres_13&fiddle=e140af96616894b81d25ea3cd643a1da – MatBailie Apr 30 '21 at 19:45