1

I'm working with a data warehouse doing report generation. As the name would suggest, I have a LOT of data. One of the queries that pulls a LOT of data is getting to take longer than I like (these aren't performed ad-hoc, these queries run every night and rebuild tables to cache the reports).

I'm looking at optimizing it, but I'm a little limited on what I can do. I have one query that's written along the lines of...

SELECT column1, column2,... columnN, (subQuery1), (subquery2)... and so on.

The problem is, the sub queries are repeated a fair amount because each statement has a case around them such as...

SELECT
    column1
    , column2
    , columnN
    , (SELECT 
        CASE
            WHEN (subQuery1) > 0 AND (subquery2) > 0 
            THEN CAST((subQuery1)/(subquery2) AS decimal)*100
            ELSE 0 
        END) AS "longWastefulQueryResults"

Our data comes from multiple sources and there are occasional data entry errors, so this prevents potential errors when dividing by a zero. The problem is, the sub-queries can repeat multiple times even though the values won't change. I'm sure there's a better way to do it...

I'd love something like what you see below, but I get errors about needing sq1 and sq2 in my group by clause. I'd provide an exact sample, but it'd be painfully tedious to go over.

SELECT
    column1
    , column2
    , columnN
    , (subQuery1) as sq1
    , (subquery2) as sq2
    , (SELECT 
        CASE
            WHEN (sq1) > 0 AND (sq2) > 0 
                THEN CAST((sq1)/(sq2) AS decimal)*100
            ELSE 0 
        END) AS "lessWastefulQueryResults"

I'm using Postgres 9.3 but haven't been able to get a successful test yet. Is there anything I can do to optimize my query?

Lukasz Re
  • 79
  • 1
  • 11
John Sly
  • 763
  • 1
  • 10
  • 31

3 Answers3

2

I'm not sure how good the Postgres optimizer is, so I'm not sure whether optimizing in this way will do any good. (In my opinion, it shouldn't because the DBMS should be taking care of this kind of thing; but it's not at all surprising if it isn't.) OTOH if your current form has you repeating query logic, then you can benefit from doing something different whether or not it helps performance...

You could put the subqueries in with clauses up front, and that might help.

  with subauery1 as (select ...)
     , subquery2 as (select ...)
select ...

This is similar to putting the subqueries in the FROM clause as Allen suggests, but may offer more flexibility if your queries are complex.

If you have the freedom to create a temp table as Andrew suggests, that too might work but could be a double-edged sword. At this point you're limiting the optimizer's options by insisting that the temp tables be populated first and then used in the way that makes sense to you, which may not always be the way that actually gets the most efficiency. (Again, this comes down to how good the optimizer is... it's often folly to try to outsmart a really good one.) On the other hand, if you do create temp or working tables, you might be able to apply useful indexes or stats (if they contain large datasets) that would further improve downstream steps' performance.

It looks like many of your subqueries might return single values. You could put the queries into a procedure and capture those individual values as variables. This is similar to the temp table approach, but doesn't require creation of objects (as you may not be able to do that) and will have less risk of confusing the optimizer by making it worry about a table where there's really just one value.

Mark Adelsberger
  • 42,148
  • 4
  • 35
  • 52
1

Yup, you can create a Temp Table to store your results and query them again in the same session

Community
  • 1
  • 1
Andrew Diamond
  • 6,295
  • 1
  • 15
  • 33
  • In many cases, a temporary view will be more performant than a temporary table because you don't have to materialize the data when it is created. – Jay K Oct 12 '18 at 20:21
0

Sub-queries in the column list tend to be a questionable design. The first approach I'd take to solving this is to see if you can move them down to the from clause.

In addition to allowing you to use the result of those queries in multiple columns, doing this often helps the optimizer to come up with a better plan for your query. This is because the queries in the column list have to be executed for every row, rather than merged into the rest of the result set.

Since you only included a portion of the query in your question, I can't demonstrate this particularly well, but what you should be looking for would look more like:

SELECT column1,
       column2,
       columnn,
       subquery1.sq1,
       subquery2.sq2,
       (SELECT CASE
           WHEN (subquery1.sq1) > 0 AND (subquery2.sq2) > 0 THEN
              CAST ( (subquery1.sq1) / (subquery2.sq2) AS DECIMAL) * 100
           ELSE
              0
        END)
          AS "lessWastefulQueryResults"
FROM   some_table
       JOIN (SELECT   *
             FROM     other_table
             GROUP BY some_columns) subquery1
          ON some_table.some_columns = subquery1.some_columns
       JOIN (SELECT   *
             FROM     yet_another_table
             GROUP BY more_columns) subquery1
          ON some_table.more_columns = subquery1.more_columns
Allan
  • 17,141
  • 4
  • 52
  • 69
  • I'm attempting to rewrite to something more like this, but I'm getting a Cartesian product that I'm not sure why it's there. For the sake of the example code, pretend column1 and 2 are keys and there's a GROUP BY for both of them. On my join, I join ON subquery1.col1 = some_table.column1 AND subquery1.col2 = some_table.column2. The sub subquery should be fetching one matching record at a time. I'm not sure what the issue is yet. – John Sly Jan 17 '17 at 21:20
  • In order to get further assistance, we will really need the full query. My guess is that you will need to add some columns to the subqueries in order to get the proper columns to join on, but it's all theoretical without a more concrete sample. – Allan Jan 17 '17 at 21:41