2

This is similar to this post, but different enough that I'm asking again. The proposed solutions would also be impractically slow in my case.

Note: I am also working with really large datasets in Redshift.

I am trying to calculate "scores" for a statistical model, which requires that I calculate a dot product for every row in a data set. In particular, I would like to create a new column in my table "SCORE" that multiplies the values for each column in the first table by a pre-specified value for that column.

For example, I have the following tables:

INITIAL

+--------+------+-------+
|  NAME  | COL0 | COL1  |
+--------+------+-------+
| AL     | 1    | 4     |
| BILL   | 2    | 5     |
| CATHY  | 3    | 6     |
+--------+------+-------+

ESTIMATES

+--------+----------+
|  NAME  | ESTIMATE |
+--------+----------+
| COL0   | 5        |
| COL1   | 10       |
+--------+----------+

and I want a final table,

FINAL

+--------+------+-------+-------+
|  NAME  | COL0 | COL1  | SCORE |
+--------+------+-------+-------+
| AL     | 1    | 4     |  45   |
| BILL   | 2    | 5     |  60   |
| CATHY  | 3    | 6     |  75   |
+--------+------+-------+-------+

For the final table, the values for each column in the INITIAL table is multiplied by a different number depending on the values listed in the ESTIMATES table For example, the SCORE for AL is derived from 1 * 5 + 4 * 10 = 45. Where the 5 and 10 come from the estimates table.

Currently, I am creating the SCORE column by writing SQL code that manually writes in the numbers from the estimates.

Community
  • 1
  • 1
d_a_c321
  • 533
  • 1
  • 11
  • 23
  • 2
    You need to be prepared for "impractically slow" no matter how it's done if your datasets are truly large and you have many columns. – user2338816 Jun 27 '14 at 23:52

4 Answers4

3

You can do this by joining the tables and then using conditional aggregation:

select i.name,
       max(i.col0) * max(case when e.name = 'col0' then estimate end) as col0,
       max(i.col1) * max(case when e.name = 'col1' then estimate end) as col1,
       (max(i.col0) * max(case when e.name = 'col0' then estimate end) +
        max(i.col1) * max(case when e.name = 'col1' then estimate end)
       ) as score       
from initial i cross join
     estimates e
group by i.name;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

Assuming estimates table will always result in one row when pivoting using case...

SELECT i.name, i.col0, i.col1, (pest.mcol0*i.col0+pest.mcol1*i.col1) as score
FROM Initial
CROSS JOIN 
(select 
  max(case name when 'COL0' then estimate end) as mcol0, 
  max(case name when 'Col1' then estimate end) as mcol1 
 FROM estimates) Pest -- pivot Estimate
xQbert
  • 34,733
  • 2
  • 41
  • 62
0

You can do that by joining the tables as:

SELECT i.NAME,i.COL0,i.COL1,
       (MAX(i.COL0) * MAX(CASE WHEN e.name = 'COL0' THEN estimate end) +
        MAX(i.COL1) * MAX(CASE WHEN e.name = 'COL1' THEN estimate end)) AS SCORE
FROM INITIAL i CROSS JOIN ESTIMATES E GROUP BY i.NAME,i.COL0,i.COL1 ORDER BY NAME;

Please refer this SQLFiddle Link as reference

09Q71A0548
  • 86
  • 5
0

For future-comers, a postgresql implementation of vector dot product is available here:

CREATE OR REPLACE FUNCTION public.dot_product(IN vector1 double precision[], IN vector2 double precision[])
    RETURNS double precision    
AS $BODY$
BEGIN
    RETURN(SELECT sum(mul) FROM (SELECT v1e*v2e as mul FROM unnest(vector1, vector2) AS t(v1e,v2e)) AS denominator);
END;
$BODY$ LANGUAGE 'plpgsql';
Shadi
  • 9,742
  • 4
  • 43
  • 65