Correlating rows to DISTINCT version in separate table

Question

I have a Postgres 9.1.4 table with a few million rows. A very small version could be this:

master

index  location
----------------
1      A
2      C
3      B
4      C
5      C
6      A

I need to do an expensive calculation on each distinct value in the location field. I don't want to use this field from the master table because I would be repeatedly processing the same location. I want a table with distinct values of location, and the result of the calculation will be stored in a calculation field:

distinct

index  location  calculation
------------------------------
1      A'        X
2      C'        Y
3      B'        Z

The relationship between master and distinct is not easily determined after distinct is populated. I have to do a little data manipulation to make the locations work in the calculation. I really need a third table, created roughly at the same time I populate distinct, to help me correlate each entry in distinct back to its parent in master.

The third table might look like this:

correlation

master_index  distinct_index
------------------------------
1             1
2             3
3             2
4             3
5             3
6             1

The problem is I don't see how this would be done with any straightforward SQL. I could use something like this as the beginning of a query to populate distinct:

SELECT location, array_agg(index)
FROM master
GROUP BY location;

The problem is I would need another array column in distinct to hold these values, then I'd later need to use some other program to parse the arrays and construct the correlation table.

Am I missing a simpler way of doing this?

Doesn't your location already reference back to the master table? Why the additional table? — rfusca, Jul 16 '12 at 17:46
Not really. The **location** in **distinct** has prime notation because it's not identical to the **location** in **master**. Only around the time I create the **distinct** table am I guaranteed to be able to make that correlation. — Aren Cambre, Jul 18 '12 at 02:50

score 1 · Accepted Answer · answered Jul 16 '12 at 13:53

You can create the "distinct" table as follows (although, I caution you to avoid using SQL key words as names of tables as columns):

create table TDistinct as
    select m.location, min(index) as TDindex, <whatever> as calculation
    from master m
    group by m.location

Create the correlation table as:

create table correlation as
    select m.index as MasterIndex, td.TDIndex
    from master m join
         TDistinct td
         on m.location = td.location

These work, but you might want something more efficient. After creating the tables, you add indexes for efficiency. You can also do other tricks, such as precreating the distinct table with an auto incremented primary key, and using that as the distinct index. Then you would use insert to load the data into the query.

score 1 · Answer 2 · edited May 23 '17 at 12:27

What you describe is not an n:m but a 1:n relationship. As such you don't need the correlation table at all to implement the relationship.

It is suboptimal design to have the location spelled out redundantly in a table with "a few million rows". You should normalize it. The big table will shrink a lot (making everything faster). Replace the column master.location with a foreign key column that references the new primary key of the dedicated location table. That's the way it should be.

Table master:

master_id  loc_id  ...
-----------------
1          1       ...
2          2
3          3
4          3
5          3
6          1

Table location:

loc_id  loc  loc_derived calc
-----------------------------
1       A    A'          X
2       C    C'          Y
3       B    B'          Z

An expensive calculation seems to warrant redundant storage of a functionally dependent value in calc. But there is nothing about the computation of the column loc_derived in your question. If it isn't very expensive (and used on a regular basis), you shouldn't include it in the location table, but create a view or function (generated column) for it.

Normally you would then add a foreign key constraint between master and location to warrant relational integrity. In future, you would add a location before you add a row in the master which references it.

How to get there?

Create the location table with a surrogate integer primary key (loc_id).
```
CREATE TABLE location (
 loc_id serial
,loc text
,loc_derived  text  -- you really need redundant storage here?
,calc text          -- seems you need redundant storage here.
);
```
There are those who would advise to use loc itself as natural primary key. I am not among them and consider this a bad idea.
- String manipulation and indexes on text columns are considerably slower than handling integers.
- You make it hard to change the name of locations if you should run into (intermediary) duplicate location names.
- Every change to a location name triggers additional updates in depending tables - doesn't happen with a surrogate key.
- The big master table will be much smaller with just an integer instead of the whole string to reference the location table.

Fill new location table (I would use a CTE for calculations).

WITH x AS (
    SELECT DISTINCT loc
    FROM master
    )
INSERT INTO location (loc, loc_derived, calc)
SELECT loc, some_calc(loc), some_expensive_calc(loc)
FROM   x;

Add loc_id to the master table.

ALTER TABLE master ADD COLUMN loc_id integer;

Fill the column loc_id.

UPDATE master m
SET    loc_id = l.loc_id
FROM   location l
WHERE  l.loc = m.loc;

Drop column master.loc.
```
ALTER TABLE master DROP COLUMN loc;
```

VACUUM / ANALYZE.

VACUUM ANALYZE; 
-- VACUUM FULL ANALYZE
-- much slower, only to shrink the table and return disk space.

Add fk constraint.

ALTER TABLE master
ADD CONSTRAINT master_loc_id_fkey FOREIGN KEY (loc_id)
REFERENCES location(loc_id);

Thanks. This is a very good comment. This dataset is really a disposable dataset for a large research project. It will be loaded once, research done on it, then disposed. Because of that, I'm not going too far with normalization. — Aren Cambre, Jul 18 '12 at 02:52

score 0 · Answer 3 · answered Jul 16 '12 at 13:30

You can do this with just the master and distinct tables. Your distinct list (even if from multiple fields) is really the basis for your foreign key. The distinct table's unique values would have a one-to-many relationship back to the master table. An index on the master table's location data and a unique index on the distinct table's location data will help with performance if joining them back together.

score 0 · Answer 4 · edited Jul 16 '12 at 16:16

I have done this SQL server. I think this is what you are looking for

You could create the distinct table using the query

select MIN("index") "index",location from tmaster
group by location

and the correlation table by the query below

select M."index" master_index,A."index" distinct_index
 from tmaster M left outer join 
(select MIN("index") "index",location from tmaster
group by location)A
on A.location=M.location

wildplasser · Answer 5 · 2012-07-16T16:44:17.167

You don't need to put correlation into a table, a view or CTE will suffice: (BTW: index is a reserved word; I replaced it with zindex (there are no SQL-keywords that start with Z, the same for zDISTINCT)

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp;
SET search_path=tmp;

-- Make some data ...
CREATE TABLE master
        ( zindex INTEGER NOT NULL PRIMARY KEY
        , location CHAR(1) NOT NULL
        );
INSERT INTO master(zindex,location) VALUES
 (1, 'A')
,(2, 'C')
,(3, 'B')
,(4, 'C')
,(5, 'C')
,(6, 'A')
        ;

    -- a view with a CTE inside
CREATE VIEW correlation AS (
        WITH zdistinct AS (
                SELECT MIN(m.zindex) AS zindex
                , m.location AS location
                FROM master m
                GROUP BY m.location
                )
        SELECT m.zindex AS master_index
                , d.zindex AS distinct_index
        FROM master m
        JOIN zdistinct d ON m.location = d.location
        );

SELECT * FROM correlation;

And, by the way: the zdistinct CTE is about the same as the "distinct" table in the original question. Just add the "calculation" field and you're home. (You could put it into a separate view first)

Thank you. This unfortunately won't work because the **zdistinct** table has a **calculation** column. That value is a complex calculation that involves a web service, so it'll have to be run after the **zdistinct** table is created. — Aren Cambre, Jul 18 '12 at 02:15

Correlating rows to DISTINCT version in separate table

5 Answers5

How to get there?