0

I am querying a table containing billions of rows (BIG_TABLE) in Oracle. The table is partitionned on a monthly basis (DTE is the month of flux). The DBA wants us to filter our queries by month of flux. The DBA warned that (i) not doing so may cause slower queries and (ii) it may impact the others users so he may kill the non-filtered queries.

Since I need several months (sometimes up to 100), I use a parametrized query to pass the months I need (more below). I collect the results for all months in an intermediate table (INTERMEDIATE_TABLE, containing about 200 million rows per month) and then I aggregate the data (FINAL_TABLE, used for the analyses). The sum must be done by CHR whatever the month of flux.

-- query1
CREATE TABLE
  INTERMEDIATE_TABLE (
    CHR VARCHAR2(255),
    NBR NUMBER,
    DTE DATE
);

-- query2
INSERT INTO
  INTERMEDIATE_TABLE
SELECT
  CHR,
  NBR,
  DTE
FROM
  BIG_TABLE
WHERE
  DTE = TO_DATE(?, 'YYYY-MM-DD');

-- query3
CREATE TABLE
  FINAL_TABLE AS
SELECT
  CHR,
  SUM(NBR) AS NBR
FROM
  INTERMEDIATE_TABLE
GROUP BY
  CHR;

After saving the queries as strings (query1, query2, query3), I use R's DBI to run the queries. dbExecute() passes the dates one by one to the equal comparison in query2, so the query is executed 3 times (i.e. the list is not passed in one go).

library(DBI)
dbConnect(odbc::odbc(), ...)
dbExecute(con, query1)
dbExecute(con, query2, params = list(c("2020-01-01", "2020-02-01", "2020-03-01")))
dbExecute(con, query3)

I would like INTERMEDIATE_TABLE to be temporary. Can I INSERT INTO a CTE?

I am open to alternative solutions but I am looking for an ANSI solution because a move to PostgreSQL is planned. I would like to avoid temporary table because only the data is temporary in Oracle, not the table. Of course I could just DROP the table afterwards, but it seems more elegant/explicit (and efficient?) to make it temporary in the first place.


A reproducible example:

CREATE TABLE
  BIG_TABLE (
    CHR VARCHAR2(255),
    NBR NUMBER,
    DTE DATE
);

INSERT ALL
  INTO BIG_TABLE VALUES ('A', 2, DATE '2020-01-01')
  INTO BIG_TABLE VALUES ('B', 3, DATE '2020-01-01')
  INTO BIG_TABLE VALUES ('A', 1, DATE '2020-02-01')
  INTO BIG_TABLE VALUES ('B', 2, DATE '2020-02-01')
  INTO BIG_TABLE VALUES ('A', 3, DATE '2020-02-01')
  INTO BIG_TABLE VALUES ('B', 2, DATE '2020-03-01')
  INTO BIG_TABLE VALUES ('B', 4, DATE '2020-03-01')
  INTO BIG_TABLE VALUES ('C', 1, DATE '2020-03-01')
  INTO BIG_TABLE VALUES ('B', 4, DATE '2020-04-01')
  INTO BIG_TABLE VALUES ('D', 2, DATE '2020-05-01')
SELECT 1 FROM DUAL;

Desired output:

CHR NBR
  A   6
  B  11
  C   1
Thomas
  • 457
  • 2
  • 12

3 Answers3

0

You can use a subquery to get your intermediate table and use it as such, so that all would run in one query.

CREATE TABLE
  FINAL_TABLE AS
SELECT
  CHR,
  SUM(NBR) NBR
FROM
  (SELECT
  CHR,
  NBR,
  DTE
FROM
  BIG_TABLE
WHERE
  DTE = TO_DATE(?, 'YYYY-MM-DD')) INTERMEDIATE_TABLE
GROUP BY
  CHR;

As CTE

CREATE TABLE FINAL_TABLE 
AS 
WITH INTERMEDIATE_TABLE AS ( 
  SELECT
      CHR,
      NBR,
      DTE
    FROM
      BIG_TABLE
    WHERE
      DTE = TO_DATE(?, 'YYYY-MM-DD')
) 
    SELECT
      CHR,
      SUM(NBR) NBR
FROM INTERMEDIATE_TABLE
GROUP BY CHR;
nbk
  • 45,398
  • 8
  • 30
  • 47
  • Thank you very much! However, after testing, I get an `ORA-01036` error. Maybe it is related to [this question](https://stackoverflow.com/a/21376164/)? – Thomas Mar 19 '23 at 23:05
  • Also, I find CTEs more readable than subqueries. Is it possible to turn this into a CTE? Thanks! – Thomas Mar 19 '23 at 23:08
  • a cte isn't better readable, but i added it anyway, but i wounder if your query 2 works as intended. with only one date it works. – nbk Mar 19 '23 at 23:21
  • 1
    To CTAS, you have to alias anything not taken straight from another column. So that would mean SUM(NBR) needs an alias. But see my answer, subqueries, CTEs, intermediate tables are all meaningless for this use. – Paul W Mar 19 '23 at 23:30
  • 1
    yeah you are right, you need an alias, but this is only a short example so your approach will not work with the real query, my question would be can he really pass a list of dates to a equal comparison – nbk Mar 19 '23 at 23:33
  • No, he should not pass a list of dates in a single exec call. @THomas, is that what you're trying to do? Your example didn't show it.. you have dte = ?. Where you thinking to plug in multiple values there? – Paul W Mar 19 '23 at 23:35
  • he writes *dbExecute(con, query2, params = list('2020-01-01', '2020-02-01', '2020-03-01'))* – nbk Mar 19 '23 at 23:46
  • I'm not familiar with that programming env. I was paying attention to the actual Oracle statements. We'll see what he says. – Paul W Mar 19 '23 at 23:51
  • To answer your comments: `query2` works because `dbExecute()` passes the dates one by one and not as a list to the equal comparison (so in my example query2 is executed 3 times). However, we need `DTE = TO_DATE(?, 'YYYY-MM-DD')` instead of `DTE = ?` (I have updated my question accordingly, could you please also edit your answer?). Finally, [bind parameters do not seems allowed in `CREATE` statements in Oracle](https://stackoverflow.com/a/21376164/11148823). Your solution gives an ORA-01036 error with the `CREATE` statement but it works without. – Thomas Mar 20 '23 at 11:03
  • 1
    @Thomas, just a note - a cte is a subquery. – jarlh Mar 20 '23 at 12:39
  • 1
    @Thomas i uodated the answer with your code change for the date comparison – nbk Mar 20 '23 at 13:48
0

You don't need any kind of intermediate stage at all, not even a subquery.

CREATE TABLE FINAL_TABLE AS
SELECT CHR,
       SUM(NBR) NBR
  FROM BIG_TABLE
 WHERE DTE = ?
 GROUP BY CHR;

Of course, with a billion rows, you want parallelism, which you may or may not be getting depending on how the table is configured. It might help to explicitly request it:

CREATE TABLE FINAL_TABLE PARALLEL (DEGREE 16) NOLOGGING  AS
SELECT CHR,
       SUM(NBR) NBR
  FROM BIG_TABLE
 WHERE DTE = ?
 GROUP BY CHR;

If you need multiple months, you have several options. I suggest a single query in a manner that maximizes parallel thread distribution by partition, if this is a partitioned table:

CREATE TABLE final_table PARALLEL (DEGREE 16) NOLOGGING AS
SELECT chr,
       SUM(nbr) nbr
  FROM (SELECT chr,
               SUM(nbr) nbr
          FROM big_table
         WHERE dte = ?
         GROUP BY chr
        UNION ALL
        SELECT chr,
               SUM(nbr) nbr
          FROM big_table
         WHERE dte = ?
         GROUP BY chr
        UNION ALL
        SELECT chr,
               SUM(nbr) nbr
          FROM big_table
         WHERE dte = ?
         GROUP BY chr)
  GROUP BY chr

If you need to pull many months and have them all end up in this table while only pulling one month at a time, then precreate the table and insert each month:

CREATE TABLE final_table(dte date,chr varchar2(18),nbr integer); -- one time

INSERT /*+ append */ into final_table
SELECT /*+ parallel(16) */
       dte,
       chr,
       SUM(nbr) nbr
  FROM big_table
 WHERE dte = ?
 GROUP BY dte, 
          chr;

COMMIT;

-- repeat for every month. 

Then at the end if you need the whole aggregated across time, you can do a group by to exclude the date:

SELECT chr,
       SUM(nbr) nbr
  FROM final_table
 GROUP BY chr
Paul W
  • 5,507
  • 2
  • 2
  • 13
  • 1
    My answer *is* a parameterized query (uses a bind variable). Maybe you can explain why it doesn't work and I can suggest something. – Paul W Mar 19 '23 at 23:22
  • See my discussion with nbk. The assumption is you are asking for one month at a time. If that's not correct, please advise. – Paul W Mar 19 '23 at 23:52
  • 1
    I edited my answer for how to populate FINAL_TABLE with multiple months while doing what your DBA asked. – Paul W Mar 20 '23 at 00:58
  • 2
    Edited my answer again, see the last part. Although, as a DBA, I would question your DBA's reasoning... if you need 100 months, it isn't a very reasonable restriction to ask you to do one month at a time. With that many, partitioning is no longer a reasonable cause for such a restriction. Perhaps there is insufficient TEMP space for the GROUP BY? If so, with billion row tables around, your TEMP should be resized to an appropriate level to support querying such objects. Our users regularly summarize billions of rows without resorting to creating any tables or doing it in chunks. – Paul W Mar 20 '23 at 02:06
  • Thanks again for your time and advices! (i) Proposals 1 & 2 do not work because the sum must be done by `CHR` across several `DTE`. (ii) Proposal 3 is not applicable because I have about 100 months to query. (iii) Proposal 4 also requires using an intermediate table (your `FINAL_TABLE` still needs an aggregation, as for my `INTERMEDIATE_TABLE`). In addition, proposals 1 to 3 give an `ORA-01036` error because, if I understood correctly, [bind parameters are not allowed in `CREATE` statements in Oracle](https://stackoverflow.com/a/21376164/11148823) (but it works without the `CREATE` statement). – Thomas Mar 20 '23 at 12:22
  • Also, we need `DTE = TO_DATE(?, 'YYYY-MM-DD')` instead of `DTE = ?` (I have updated my question accordingly, could you please also edit your answer?). – Thomas Mar 20 '23 at 12:27
  • #3 would work even with 100 UNION ALLs... that's why dynamic code is for.. just takes a simple loop to build it. #4 would work also - there is no reason to aggregate as you go. You'd have to do a merge to insert-or-update each set but that's far more work and less efficient than just inserts and you simply aggregate at query time. Lastly, on the bind variable for DDL, don't use binds, simply string concatenate the entire DDL with the date as part of the string. That's okay for DDL because DDL is supposed to happen rarely, not in bulk volume. – Paul W Mar 20 '23 at 12:29
  • 1
    DTE = TO_DATE(?, 'YYYY-MM-DD') is only appropriate if you are binding a string. Normally applications are expected to bind the correct data type. If you bind a real date, you do not use TO_DATE around it. – Paul W Mar 20 '23 at 12:29
  • 2
    I think you should go back to your DBA and ask for the reason for the one-month-at-a-time constraint. You are getting complex and creating difficulties for yourself without any clear reason for doing so. – Paul W Mar 20 '23 at 12:30
  • Thank you so much Paul for your patience and advices. I was blindly following the one-month-at-a-time constraint since I had use the DB. For the first time I just filtered to what I need in one go and the query run faster as you suggested! I will ask my DBA for sure. – Thomas Mar 20 '23 at 19:22
  • I have finally accepted your answer because of you advice to drop the one-month-at-a-time constraint in your last comment. I was not expecting such advice when asking for "alternative solutions". This is so much easier! Ideally, could you please update your answer to include this comment? – Thomas Mar 25 '23 at 11:12
0

Consider using materialized views (MVs) to do this instead of temporary tables. This gives the same effect - you're precomputing and storing the results in table - but with more in-built controls to help you refresh the data.

For example, create this MV:

create materialized view summaries as
  select chr, trunc ( dte, 'mm' ) mth, sum ( nbr ) tot
  from   big_table
  group  by chr, trunc ( dte, 'mm' );

This will store one row/chr/month. Unless there's millions of values for 'chr' each month, this should be enough to avoid the requirement to "filter by month" as you'll be reading substantially less data each time.

If you make the MV fast refresh on commit, then the database will keep it up-to-date for you automatically.

Even if you're forced to go down the "filter by month" approach, you could still build this into the MV.

Create a filtering table to store the months you want, insert the values before each run, and refresh the MV:

create table filter_table (
  month_value date
);

create materialized view summaries as
  select chr, trunc ( dte, 'mm' ) mth, sum ( nbr ) tot
  from   big_table bt
  join   filter_table ft
  on     trunc ( dte, 'mm' ) = ft.month_value
  group  by chr, trunc ( dte, 'mm' );
  
insert into filter_table values ( date '2020-01-01' );
commit;

exec dbms_mview.refresh ( 'summaries' );

select * from summaries;

CHR    MTH                  TOT    
B      01-JAN-2020 00:00         3 
A      01-JAN-2020 00:00         2 
Chris Saxon
  • 9,105
  • 1
  • 26
  • 42