Group by running total resetting every time the counter is incremented

Question

I have the below table (my input would be just columns NIF, FILE and FILESIZE).

I would like to obtain columns PARTITION and SUBPARTITION with the following rules:

PARTITION should start with 1 and be incremented by 1 every tune SUBPARTITION reaches 8.
A single NIF cannot be in two different PARTITIONs. (This is the most restrictive condition actually).
SUBPARTITION must be incremented by 1 every time the running total in FILESIZE reaches 10. That running total must be reset when SUBPARTITION is incremented by 1.

Data:

  NIF  FILE FILESIZE    PARTITION   SUBPARTITION
  ----------------------------------------------
   A     C1        1            1              1
   A     C2        1            1              1
   A     C3        2            1              1
   A     C4        1            1              1
   B     C5        5            1              2
   B     C6        1            1              2
   C     C7        2            1              2
   C     C8        1            1              2
   D     C9        4            1              3
   D    C10        5            1              3
   D    C11        1            1              3
   D    C12        2            1              4
   D    C13        3            1              4
   D    C14        4            1              4
   D    C15        5            1              5
   E    C16        3            1              6
   E    C17        2            1              6
   E    C18        3            1              6
   E    C19        4            1              7
   F    C20        6            2              1
   F    C20        2            2              1

It does not need to be a single query, doing a PSLQL cursor would also be fine.

NOTE: this is the code to create the input table.

DROP TABLE my_table;
-- Create the table
CREATE TABLE my_table (
  NIF VARCHAR2(10),
  FILE_ VARCHAR2(10),
  FILESIZE NUMBER
);

-- Insert the input data
INSERT INTO my_table (NIF, FILE_, FILESIZE)
    SELECT 'A', 'C1', 1 FROM DUAL UNION ALL
    SELECT 'A', 'C2', 1 FROM DUAL UNION ALL
    SELECT 'A', 'C3', 2 FROM DUAL UNION ALL
    SELECT 'A', 'C4', 1 FROM DUAL UNION ALL
    SELECT 'B', 'C5', 5 FROM DUAL UNION ALL
    SELECT 'B', 'C6', 1 FROM DUAL UNION ALL
    SELECT 'C', 'C7', 2 FROM DUAL UNION ALL
    SELECT 'C', 'C8', 1 FROM DUAL UNION ALL
    SELECT 'D', 'C9', 4 FROM DUAL UNION ALL
    SELECT 'D', 'C10', 5 FROM DUAL UNION ALL
    SELECT 'D', 'C11', 1 FROM DUAL UNION ALL
    SELECT 'D', 'C12', 2 FROM DUAL UNION ALL
    SELECT 'D', 'C13', 3 FROM DUAL UNION ALL
    SELECT 'D', 'C14', 4 FROM DUAL UNION ALL
    SELECT 'D', 'C15', 5 FROM DUAL UNION ALL
    SELECT 'E', 'C16', 3 FROM DUAL UNION ALL
    SELECT 'E', 'C17', 2 FROM DUAL UNION ALL
    SELECT 'E', 'C18', 3 FROM DUAL UNION ALL
    SELECT 'E', 'C19', 4 FROM DUAL UNION ALL
    SELECT 'F', 'C20', 6 FROM DUAL UNION ALL
    SELECT 'F', 'C21', 2 FROM DUAL;
    
COMMIT;

Not sure it is possible generally. Consider 11 NIF='E' entries with FILESIZE=9. Rule #3 requires 11 subpartitions must be created, rule #1 says you need at least 2 partitions for that, and rule #2 requires that all subs must be in a single partition. — Serg, Jul 13 '23 at 10:39

MT0 · Accepted Answer · 2023-07-13T13:00:18.277

From Oracle 12, you can use a combination of analytic functions and MATCH_RECOGNIZE to perform row-by-row processing:

SELECT cto,
       nif,
       file_,
       partition,
       mn - first_mn + 1 AS subpartition,
       total_size
FROM   (
  SELECT t.*,
         ROW_NUMBER() OVER (
           PARTITION BY cto
           ORDER BY SUBSTR(file_, 1, 1), TO_NUMBER(SUBSTR(file_, 2))
         ) AS rn,
         SUM(filesize) OVER (PARTITION BY nif) AS nif_size
  FROM   my_table t
)
MATCH_RECOGNIZE(
  PARTITION BY cto
  ORDER BY rn
  MEASURES
    MATCH_NUMBER() AS mn,
    SUM(filesize) AS total_size
  ALL ROWS PER MATCH
  PATTERN ( (same_nif* nif)* same_nif* end_nif )
  DEFINE
    same_nif AS nif = NEXT(nif),
    nif      AS  (NEXT(nif) IS NULL OR nif != NEXT(nif))
             AND SUM(filesize) + NEXT(nif_size) <= 10,
    end_nif  AS  SUM(filesize) <= 10
)
MATCH_RECOGNIZE(
  PARTITION BY cto
  ORDER BY rn
  MEASURES
    MATCH_NUMBER() AS partition,
    FIRST(mn) AS first_mn
  ALL ROWS PER MATCH
  PATTERN ( (same_nif* nif)+ )
  DEFINE
    same_nif AS nif = NEXT(nif),
    nif      AS (NEXT(nif) IS NULL OR nif != NEXT(nif))
             AND mn - FIRST(mn) < 7
);

In first match, for the sub-partitions, the pattern:

same_nif* nif will match from the current row to the end of the current NIF partition and ensure that the next NIF partition will also entirely be contained in the match by checking the current size plus the total size of the next NIF partition is within the limits;
Wrapping it in ()* allows that to be matched zero-or-more times (as many as is possible);
The final same_nif* end_nif allows the match to partially match a set of rows with the same nif values (such as the files C9, C10 and C11 which has the same nif as C12, C13 and C14) but would be in a different group as they file size totals more than the limit. Without a different end condition you could end up either grouping C1 through C5 or not being able to split the D nif values.

In the second match, for the partitions, the nif values need to be kept together in the same match (whereas for sub-partitions nif values can span multiple matches) and, apart from that, the only check needed is that there are not more than 7 sub-partitions in a group.

Which, for the sample data (which has an added CTO column and an additional nif group G that is too large to be combined in the same partition as the previous nif group F):

CREATE TABLE my_table (
  CTO NUMBER,
  NIF VARCHAR2(10),
  FILE_ VARCHAR2(10),
  FILESIZE NUMBER
);

INSERT INTO my_table (CTO, NIF, FILE_, FILESIZE)
    SELECT 1, 'A', 'C1', 1 FROM DUAL UNION ALL
    SELECT 1, 'A', 'C2', 1 FROM DUAL UNION ALL
    SELECT 1, 'A', 'C3', 2 FROM DUAL UNION ALL
    SELECT 1, 'A', 'C4', 1 FROM DUAL UNION ALL
    SELECT 1, 'B', 'C5', 5 FROM DUAL UNION ALL
    SELECT 1, 'B', 'C6', 1 FROM DUAL UNION ALL
    SELECT 1, 'C', 'C7', 2 FROM DUAL UNION ALL
    SELECT 1, 'C', 'C8', 1 FROM DUAL UNION ALL
    SELECT 1, 'D', 'C9', 4 FROM DUAL UNION ALL
    SELECT 1, 'D', 'C10', 5 FROM DUAL UNION ALL
    SELECT 1, 'D', 'C11', 1 FROM DUAL UNION ALL
    SELECT 1, 'D', 'C12', 2 FROM DUAL UNION ALL
    SELECT 1, 'D', 'C13', 3 FROM DUAL UNION ALL
    SELECT 1, 'D', 'C14', 4 FROM DUAL UNION ALL
    SELECT 1, 'D', 'C15', 5 FROM DUAL UNION ALL
    SELECT 1, 'E', 'C16', 3 FROM DUAL UNION ALL
    SELECT 1, 'E', 'C17', 2 FROM DUAL UNION ALL
    SELECT 1, 'E', 'C18', 3 FROM DUAL UNION ALL
    SELECT 1, 'E', 'C19', 4 FROM DUAL UNION ALL
    SELECT 1, 'F', 'C20', 6 FROM DUAL UNION ALL
    SELECT 1, 'F', 'C21', 2 FROM DUAL UNION ALL
    SELECT 1, 'G', 'C' || (LEVEL + 21), 10 FROM DUAL CONNECT BY LEVEL <= 7;

Outputs:

CTO	NIF	FILE_	PARTITION	SUBPARTITION	TOTAL_SIZE
1	A	C1	1	1	1
1	A	C2	1	1	2
1	A	C3	1	1	4
1	A	C4	1	1	5
1	B	C5	1	2	5
1	B	C6	1	2	6
1	C	C7	1	2	8
1	C	C8	1	2	9
1	D	C9	1	3	4
1	D	C10	1	3	9
1	D	C11	1	3	10
1	D	C12	1	4	2
1	D	C13	1	4	5
1	D	C14	1	4	9
1	D	C15	1	5	5
1	E	C16	1	6	3
1	E	C17	1	6	5
1	E	C18	1	6	8
1	E	C19	1	7	4
1	F	C20	2	1	6
1	F	C21	2	1	8
1	G	C22	3	1	10
1	G	C23	3	2	10
1	G	C24	3	3	10
1	G	C25	3	4	10
1	G	C26	3	5	10
1	G	C27	3	6	10
1	G	C28	3	7	10

fiddle

Sorry because this was not in the original scope of the question but a slight enhancement. If there was another column at the beginning called CTO and I wanted to make the same logic but starting over again when CTO changes, how would I modify the pattern? Also, why have you defined the pattern as ( (same_nif* nif)* same_nif* end_nif ) as opposed to same_nif * nif * end_nif? — Javi Torre, Jul 13 '23 at 10:20
@JaviTorre Updated to implement partitioning by `CTO` and also to fully implement rule #2 (although you may still have issues if you have more than 7 sub-partitions with the same `nif` as it is then impossible to meet rule #2). — MT0, Jul 13 '23 at 12:01
What does the pattern mean though? And why not same_nif * nif * end_nif? — Javi Torre, Jul 13 '23 at 12:41
@JaviTorre The first pattern is for the sub-partitions and needs to be able to split `nif` groups on the `filesize` limit so that successive sub-partitions can have the same `nif` value. The second pattern is for splitting into partitions and the `nif` values need to be kept together. The patterns are different because they have different purposes. — MT0, Jul 13 '23 at 12:57

score 0 · Answer 2 · answered Jul 13 '23 at 09:07

You could create a temporary table or a real yable with the fields of (NIF, FILE, FILESIZE, PARTITION, SUBPARTITION). You could add an id to it as well, but for the purpose of this answer I will describe only the absolute minimum.

You would set PARTITION and SUBPARTITION to be nullable and insert lines via commands like

insert into yourtable(NIF, FILE, FILESIZE)
values('A', 'C1', 1);

and implement a before insert trigger that will find max(PARTITION) to find out what the current partition is as well as the max(SUBPARTITION) of that partition along with the sum(FILESIZE) in that partition. Something like (this is an agnostic query, you will need to save the fields into local variables)

select PARTITION, SUBPARTITION, sum(FILESIZE) storage_use
from yourtable
group by PARTITION, SUBPARTITION
order by PARTITION desc, SUBPARTITION desc
limit 1

From here on, inside your trigger you can see whether storage_use + the actual FILESIZE of the record you are to insert is greater than 10. If not, then you set PARTITION and SUBPARTITION of your new record to those that you found in the query. If not, then you can increment SUBPARTITION as long as it's lower than or equal with 8. If SUBPARTITION is 8 and storage_use does not allow you to fit the new record into it, then store PARTITION + 1 and SUBPARTITION + 1 as PARTITION and SUBPARTITION, respectively.

score 0 · Answer 3 · answered Jul 16 '23 at 11:28

You can also use MODEL clause:

select nif, file_, filesize, partition, subpartition 
from (
    select 
      row_number() over(order by nif, to_number(substr(file_,2))) as rn, 
      nif, file_, to_number(substr(file_,2)) as file_order,
      filesize 
    from my_table
)
model
    dimension by (rn)
    measures( nif as nif, file_ as file_, file_order as file_order, 
        filesize as filesize,
        0 as partition, 0 as incsubpartition, 1 as subpartition, 0 as running_total )
    rules (
        running_total[any] = 
           case when nvl(running_total[cv()-1],0) + filesize[cv()] >= 10 
           then filesize[cv()]
           else nvl(running_total[cv()-1],0) + filesize[cv()]
           end,
        incsubpartition[any] = 
           case when nvl(running_total[cv()-1],0) + filesize[cv()] >= 10 
           then 1
           else 0
           end,
       -- rule 3
       subpartition[any] = nvl(subpartition[cv()-1],1) + incsubpartition[cv()], 
       partition[any] = 
           case when subpartition[cv()] = 8 and subpartition[cv()-1] <> 8 
           then partition[cv()-1] + 1
           else nvl(partition[cv()-1],1)
           end,
       -- rule 1
       subpartition[any] = 
          case when subpartition[cv()] = 8 
          then 1 
          else subpartition[cv()]
          end,
       -- rule 2
       partition[any] =
          case when nif[cv()] = nif[cv()-1] 
          then partition[cv()-1]
          else partition[cv()]
          end
   )
order by nif, file_order
;


NIF FILE_   FILESIZE    PARTITION   SUBPARTITION
A   C1  1   1   1
A   C2  1   1   1
A   C3  2   1   1
A   C4  1   1   1
B   C5  5   1   2
B   C6  1   1   2
C   C7  2   1   2
C   C8  1   1   2
D   C9  4   1   3
D   C10 5   1   3
D   C11 1   1   4
D   C12 2   1   4
D   C13 3   1   4
D   C14 4   1   5
D   C15 5   1   5
E   C16 3   1   6
E   C17 2   1   6
E   C18 3   1   6
E   C19 4   1   7
F   C20 6   2   1
F   C21 2   2   1

CTO	NIF	FILE_	PARTITION	SUBPARTITION	TOTAL_SIZE
1	A	C1	1	1	1
1	A	C2	1	1	2
1	A	C3	1	1	4
1	A	C4	1	1	5
1	B	C5	1	2	5
1	B	C6	1	2	6
1	C	C7	1	2	8
1	C	C8	1	2	9
1	D	C9	1	3	4
1	D	C10	1	3	9
1	D	C11	1	3	10
1	D	C12	1	4	2
1	D	C13	1	4	5
1	D	C14	1	4	9
1	D	C15	1	5	5
1	E	C16	1	6	3
1	E	C17	1	6	5
1	E	C18	1	6	8
1	E	C19	1	7	4
1	F	C20	2	1	6
1	F	C21	2	1	8
1	G	C22	3	1	10
1	G	C23	3	2	10
1	G	C24	3	3	10
1	G	C25	3	4	10
1	G	C26	3	5	10
1	G	C27	3	6	10
1	G	C28	3	7	10

CTO	NIF	FILE_	PARTITION	SUBPARTITION	TOTAL_SIZE
1	A	C1	1	1	1
1	A	C2	1	1	2
1	A	C3	1	1	4
1	A	C4	1	1	5
1	B	C5	1	2	5
1	B	C6	1	2	6
1	C	C7	1	2	8
1	C	C8	1	2	9
1	D	C9	1	3	4
1	D	C10	1	3	9
1	D	C11	1	3	10
1	D	C12	1	4	2
1	D	C13	1	4	5
1	D	C14	1	4	9
1	D	C15	1	5	5
1	E	C16	1	6	3
1	E	C17	1	6	5
1	E	C18	1	6	8
1	E	C19	1	7	4
1	F	C20	2	1	6
1	F	C21	2	1	8
1	G	C22	3	1	10
1	G	C23	3	2	10
1	G	C24	3	3	10
1	G	C25	3	4	10
1	G	C26	3	5	10
1	G	C27	3	6	10
1	G	C28	3	7	10

Group by running total resetting every time the counter is incremented

3 Answers3

CTO	NIF	FILE_	PARTITION	SUBPARTITION	TOTAL_SIZE
1	A	C1	1	1	1
1	A	C2	1	1	2
1	A	C3	1	1	4
1	A	C4	1	1	5
1	B	C5	1	2	5
1	B	C6	1	2	6
1	C	C7	1	2	8
1	C	C8	1	2	9
1	D	C9	1	3	4
1	D	C10	1	3	9
1	D	C11	1	3	10
1	D	C12	1	4	2
1	D	C13	1	4	5
1	D	C14	1	4	9
1	D	C15	1	5	5
1	E	C16	1	6	3
1	E	C17	1	6	5
1	E	C18	1	6	8
1	E	C19	1	7	4
1	F	C20	2	1	6
1	F	C21	2	1	8
1	G	C22	3	1	10
1	G	C23	3	2	10
1	G	C24	3	3	10
1	G	C25	3	4	10
1	G	C26	3	5	10
1	G	C27	3	6	10
1	G	C28	3	7	10