Collapsing multiple rows containing substrings into a single row

Question

I have several rows of records (id, query,count) that I would like to collapse by observing the query on multiple rows. I want to retain the row with the longest query and the sum of the count field for the rows that are collapsed.

Sample input:

24, que, 2
24, querie, 1
24, queries, 1
25, term1, 3
25, term1+term2, 11
25, term1+term2+term3, 1
26, inventory, 5
26, issues, 10
27, close, 1
27, sclosed, 2
28, abcde, 2
28, abcfe, 2

Required output:

24, queries, 4
25, term1+term2+term3, 15
26, inventory, 5
26, issues, 10
27, close, 1
27, sclosed, 2
28, abcde, 2
28, abcfe, 2

I'm only after a special case of substrings: 24,25 is collapsed but 27 is not, due to the prefix on close. 26 is also not collapsed as the query field in the second row is not a substring (not prefixed) by the first.

Edit: Added id 28 which is another case where the records should not be collapsed.

S-Man · Accepted Answer · 2018-09-22T18:01:28.847

demo:db<>fiddle

For more general cases (e.g. "the difference could be at the 10th letter" or "there is a row with just one char") you need to identify the right groups. So it would be necessary to check rows against the next one: "Is the current row a beginning of the next one?"

With some things like substring you can check a special length at the beginning ("Group all texts beginning with the same 3 letters" But what if you do not have 3 letters? Or the difference is somewhere later?)

That's why I calculated special groups with the help of the lag window function (https://www.postgresql.org/docs/current/static/tutorial-window.html):

SELECT 
    max(id) as id,                                        -- C
    max(phrase) as phrase,
    sum("count") as count
FROM (
    SELECT 
        *,
        SUM(is_diff) OVER (ORDER BY id, phrase) as ranked -- B
    FROM (
        SELECT
            *,
            -- A: 
            CASE WHEN phrase LIKE (lag(phrase) over (order by id, phrase)) || '%' THEN 0 ELSE 1 END as is_diff
        FROM phrases 
    )s
) s
GROUP BY ranked
ORDER BY ranked

The main idea is discussed here.

A: The lag function allows to check the value of the next row. So if the phrase of the current row the beginning of the phrase of the next row then they are in the same group. (current_row LIKE (next_row || '%')). This works because the id groups are ordered by phrase texts (and their lengths).

If the rows are not compatible a helper variable is set to 1, otherwise to 0.

B: The helper variables can be added and the groups are generated. (For more details see the link provided above).

C: The rest is simple grouping by the new generated group values.

This is exactly what I'm after, great catch for the added case, of `abcde` and `abcfe`. Will add that to my question too. Didn't realise those general cases, thanks! — kami, Sep 21 '18 at 21:41

Fahmi · Answer 2 · 2018-09-21T07:29:53.887

0

Use aggregation for max and sum and group by column should be id and substring(column2,from 1 for 3) as you are following some pattern for total value

select id, max(column2), sum(column3)
from tablename
group by id, substring(column2,from 1 for 3)

edited Sep 21 '18 at 07:29

answered Sep 21 '18 at 07:06

Fahmi

37,315
5
22
31

ScaisEdge · Answer 3 · 2018-09-21T07:19:45.073

0

If you need agreagtio for id then you could use sum for get the total and max for obtain the name

    select  id, max(col2), sum(col3)
    from my_table 
    group by  id

if you need expectio for aggregation by id and for obtain 27 collapsed you could use

    select  id, max(col2), sum(col3)
    from my_table 
    where id <>27
    group by  id 
    union  all
    select  id, col2, sum(col3)
    from my_table 
    where id =27
    group by  id, cold2
    order by id

edited Sep 21 '18 at 07:19

answered Sep 21 '18 at 07:10

ScaisEdge

131,976
10
91
107

Great trick, but record 26,27 will be collapsed with the max of the two strings. – kami Sep 21 '18 at 07:14
@kami There is a logic for this or is just an execpion without motivation ?.. try expalin the logic for obtain this – ScaisEdge Sep 21 '18 at 07:17
Majority of the cases in the table are in the form 26, 27 for which I don't need a collapse. The query is a search term and 24, 25 are query suggestions. I want to collapse them and store only the longest query as this is the only one I'm interested in. – kami Sep 21 '18 at 07:27

Pavel Stehule · Answer 4 · 2018-09-21T08:28:58.520

This task cannot be solved with aggregates - because in your case, because grouping subsets are defined by equality, and you want to use substring operation.

So you need to implement own set returning function:

create table foo(id int, query text, count int);

CREATE OR REPLACE FUNCTION public.reduce()
 RETURNS SETOF foo
 LANGUAGE plpgsql
AS $function$
declare r foo; sr foo;
begin
  for r in select * from foo order by id, query
  loop
    if sr.id is null then
      sr := r;
    else
      if sr.id = r.id then
        if r.query like  sr.query || '%'
           or sr.query like r.query || '%' 
        then
          if length(r.query) > length(sr.query) then
            sr.query := trim(r.query);
          end if;
          sr.count := sr.count + r.count; 
        else
          return next sr;
          sr := r;
        end if;
      else
        return next sr;
        sr = r;
      end if;
    end if;
  end loop;
  if sr.id is not null then
    return next sr;
  end if;
end;
$function$

postgres=# select * from reduce();
+----+-------------------+-------+
| id |       query       | count |
+----+-------------------+-------+
| 24 | queries           |     4 |
| 25 | term1+term2+term3 |    15 |
| 26 | inventory         |     5 |
| 26 | issues            |    10 |
| 27 | close             |     1 |
| 27 | sclosed           |     2 |
+----+-------------------+-------+
(6 rows)

The string operations can be slow, but this task is pretty unrelational and is not possible do it with only SQL.

score -1 · Answer 5 · answered Sep 21 '18 at 07:38

-1

select  id, max(col2), sum(col3)
    from tablename
    where id in (24,25)
    group by  id 
    union  all
    select  id, col2, sum(col3)
    from tablename
    where id not in (24,25)
    group by  id,col3
    order by id,3;

answered Sep 21 '18 at 07:38

theDbGuy

903
1
9
22

Collapsing multiple rows containing substrings into a single row

5 Answers5