2

I am trying to build a basic report structure, and I have an id in table1, that is basically a user_id, that then any time a user does something or is associated with something, that ID is in other tables capturing those actions.

I would like to get for all the table1 records for a day, how records in these other tables are associated with these. (example for this user, he has 10 widgets, 15 sign_ins, 20 generic_actions)

Here's my query that gives the correct result, but SUPER inefficient returns about 6 million rows (when not distinct) and should only be a few thousand.

Results I'm getting says basically if I do this for one user and get the results above I get 1*10*15*20 rows returned, when I really want 1 row, with how many of each. Yes I know I can count distinct, but its still not joining right and searching so many rows that its prohibitive. Is there a join type or something I am missing to efficiently join without all the extra rows?

SELECT 
 DISTINCT DATE_TRUNC('day',table1.created_at) as c_date,
 count(distinct table1.id) as t1_tot,
 count(distinct table2.id) as t2_tot,
 count(distinct table3.id) as t3_tot,
 count(distinct table4.id) as t4_tot,
 count(distinct table5.id) as t5_tot,
FROM 
 table1 
LEFT JOIN 
 table2 ON table1.id = table2.t1_id 
LEFT JOIN 
table3 ON table1.id = table3.t1_id 
LEFT JOIN 
 table4 ON table1.id = table4.t1_id 
LEFT JOIN 
 table5 ON table1.id = table5.t1_id  
WHERE 
(table1.created_at >= '02-02-2015' AND table1.created_at <= '02-05-2015') 
GROUP BY c_date  
ORDER BY c_date desc

Is there actually a way to get what I want with a join? The query is so expensive it times out.

I am using postgres and rails, all of these are associated with models and associations.

UPDATE Testing Andrews comment got query plan

Unique  (cost=1062356.31..1062384.74 rows=2843 width=16)
   ->  Sort  (cost=1062356.31..1062363.42 rows=2843 width=16)
     Sort Key: (date_trunc('day'::text, table1.created_at)), (count(table1.id)), (count(table1_1.id))
     ->  HashAggregate  (cost=1062157.68..1062193.22 rows=2843 width=16)
           ->  Merge Right Join  (cost=0.58..1062136.35 rows=2845 width=16)
                 Merge Cond: (table1_1.id = table1.id)
                 ->  GroupAggregate  (cost=0.29..1059054.94 rows=41399 width=4)
                       ->  Nested Loop  (cost=0.29..756842.24 rows=60359742 width=4)
                             ->  Index Only Scan using table1_pkey on table1 table1_1  (cost=0.29..2314.24 rows=41399 width=4)
                             ->  Materialize  (cost=0.00..34.87 rows=1458 width=0)
                                   ->  Seq Scan on table2  (cost=0.00..27.58 rows=1458 width=0)
                 ->  Index Scan using table1_pkey on table1  (cost=0.29..2521.24 rows=2845 width=12)
                       Filter: ((created_at >= '2015-02-02 00:00:00'::timestamp without time zone) AND (created_at <= '2015-02-05 00:00:00'::timestamp without time zone))
LCIII
  • 3,102
  • 3
  • 26
  • 43
Teylewr
  • 33
  • 4
  • All you need from the other tables is the count? If so I'd suggest not joining the tables on ID but on a query such as... LEFT JOIN (SELECT t1_id , COUNT(t1_id) FROM table2 GROUP BY t1_id) AS newtable2 ON table1.id = newtable2.t1_id Not 100% sure that I wrote that out correctly but that is the basic idea. – Andrew Whatever Feb 06 '15 at 18:21
  • Yes, if t2..t5 are not related, just don't join them altogether. Use 'UNION ALL' if really need to have one query. – Matt Feb 06 '15 at 18:33
  • Andrew, just trying this with one of the tables, it was extremely slow and gave me matching results for each column select DISTINCT DATE_TRUNC('day’,table1.created_at) as a_date, count(table1.id) as t1_tot,count(newtable2.t1_id) as t2_tot FROM table1 LEFT JOIN (SELECT table1.id as t1_id, COUNT(table1.id) FROM table2,table1 GROUP BY table1.id) AS newtable2 ON table1.id = newtable2.t1_id WHERE (table1.created_at >= '02-02-2015' AND table1.created_at <= '02-05-2015' GROUP BY c_date ORDER BY c_date desc; – Teylewr Feb 06 '15 at 19:13
  • user4419802 I am using postgres inside of rails. What would a union all look like as an example? Every time i try and union these i get postgress errors. – Teylewr Feb 06 '15 at 19:14
  • I'm assuming all of the necessary columns are indexed correctly? I do a lot of similar joins on our tables, many of which are tables that hold 5-10 million records, and they run pretty quickly? – Andrew Whatever Feb 06 '15 at 19:17
  • i have indexes on created_at and t1_id on all tables. Theres only like 40k rows in table1 and like 1500 in table2. Adding this query plan above – Teylewr Feb 06 '15 at 19:20

2 Answers2

0

Results I'm getting says basically if I do this for one user and get the results above i get 1*10*15*20 rows returned, when i really want 1 row, with how many of each.

If you're just looking for count of ids of each table then there is no need to join them. This query may work for you:

SELECT 
 DISTINCT DATE_TRUNC('day',table1.created_at) as c_date,
 SUM(CASE WHEN TableId = 'Table1' THEN IdCount ELSE 0 END) AS t1_tot,
 SUM(CASE WHEN TableId = 'Table2' THEN IdCount ELSE 0 END) AS t2_tot,
 SUM(CASE WHEN TableId = 'Table3' THEN IdCount ELSE 0 END) AS t3_tot,
 SUM(CASE WHEN TableId = 'Table4' THEN IdCount ELSE 0 END) AS t4_tot,
 SUM(CASE WHEN TableId = 'Table5' THEN IdCount ELSE 0 END) AS t5_tot
FROM 
(
SELECT COUNT(id) AS 'IdCount', 'Table1' AS TableId FROM table1 UNION ALL
SELECT COUNT(id) AS 'IdCount', 'Table2' AS TableId FROM table2 UNION ALL
SELECT COUNT(id) AS 'IdCount', 'Table3' AS TableId FROM table3 UNION ALL
SELECT COUNT(id) AS 'IdCount', 'Table4' AS TableId FROM table4 UNION ALL
SELECT COUNT(id) AS 'IdCount', 'Table5' AS TableId FROM table5) AS innerTable
)
WHERE 
(table1.created_at >= '02-02-2015' AND table1.created_at <= '02-05-2015') 
GROUP BY c_date  
ORDER BY c_date desc

I haven't tested it, but it should give you the idea of what to do, and since you're not joining everything it should run way faster. The innerTable basically returns the count of ids in each table as a separate row for each table, where the outer select transposes it into 1 row, which is what you seem like you want.

Getting the row counts is easy; most of the work is in transposing.

Community
  • 1
  • 1
LCIII
  • 3,102
  • 3
  • 26
  • 43
  • I only want the counts of the other table ids when they have an existing record, so if table1.id = 1 i want the count of table2 records that include table1_id = 1. It could return multiple rows, if i do say 5 days, i need 5 rows, one for each day, which gets table1 id counts, table2 counts etc for each individual day. It feels like something simple, that I could be overcomplicating. – Teylewr Feb 06 '15 at 21:52
0

You should break up your query into several subqueries. That is, instead of

SELECT SUM1, SUM2, SUM3, SUM4 FROM (A Join B Join C Join D)

it should look like

SELECT SQ1.SUM1+SQ2.SUM1+SQ3.SUM1, SQ1.SUM2, SQ2.SUM3, SQ3.SUM4 FROM
(SELECT SUM1, SUM2 FROM A JOIN B) SQ1 CROSS JOIN
(SELECT SUM1, SUM3 FROM A JOIN C) SQ2 CROSS JOIN
(SELECT SUM1, SUM4 FROM A JOIN D) SQ3

Note, that it will be long and ugly, yet really fast, because all subqueries return just one row.

Matt
  • 13,674
  • 1
  • 18
  • 27