6

I'm struggling doing aggregations on a JSONB field in a PostgreSQL database. This is probably easier explained with an example so if create and populate a table called analysis with 2 columns (id and analysis) as follows: -

create table analysis (
  id serial primary key,
  analysis jsonb
);

insert into analysis 
  (id, analysis) values
  (1,  '{"category" : "news",    "results" : [1,   2,  3,  4,  5 , 6,  7,  8,  9,  10,  11,  12,  13,  14, null, null]}'),
  (2,  '{"category" : "news",    "results" : [11, 12, 13, 14, 15, 16, 17, 18, 19,  20,  21,  22,  23,  24, null,   26]}'),
  (3,  '{"category" : "news",    "results" : [31, 32, 33, 34, 35, 36, 37, 38, 39,  40,  41,  42,  43,  44,   45,   46]}'),
  (4,  '{"category" : "sport",   "results" : [51, 52, 53, 54, 55, 56, 57, 58, 59,  60,  61,  62,  63,  64,   65,   66]}'),
  (5,  '{"category" : "sport",   "results" : [71, 72, 73, 74, 75, 76, 77, 78, 79,  80,  81,  82,  83,  84,   85,   86]}'),
  (6,  '{"category" : "weather", "results" : [91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,  105,  106]}');

As you can see the analysis JSONB field always contains 2 attributes category and results. The results attribute will always contain an fixed length array of size 16. I've used various functions such as jsonb_array_elements but what I'm trying to do is the following: -

  1. Group by analysis->'category'
  2. Average of each array element

When I want is a statement to return 3 rows grouped by category (i.e. news, sport and weather) and a 16 fixed length array containing averages. To further complicate things, if there are nulls in the array then we should ignore them (i.e. we are not simply summing and averaging by the number of rows). The result should look something like the following: -

 category  | analysis_average
-----------+--------------------------------------------------------------------------------------------------------------
 "news"    | [14.33, 15.33, 16.33, 17.33, 18.33, 19.33, 20.33, 21.33, 22.33, 23.33, 24.33, 25.33, 26.33, 27.33,  45,  36]
 "sport"   | [61,       62,    63,    64,    65,    66,    67,    68,    69,    70,    71,    72,    73,    74,  75,  76]
 "weather" | [91,       92,    93,    94,    95,    96,    97,    98,    99,    00,   101,   102,   103,   104, 105, 106]

NOTE: Notice the 45 and 36 in the last 2 array itmes on the 1st row which illustrates ignoring the nullss.

I had considered creating a view which exploded the array into 16 columns i.e.

create view analysis_view as
select a.*,
(a.analysis->'results'->>0)::int as result0,
(a.analysis->'results'->>1)::int as result1
/* ... etc for all 16 array entries .. */
from analysis a;

This seems extremely inelegant to me and removes the advantages of using an array in the first place but could probably hack something together using that approach.

Any pointers or tips will be most appreciated!

Also performance is really important here so the higher the performance the better!

bobmarksie
  • 3,282
  • 1
  • 41
  • 54

3 Answers3

4

This will work for any array length

select category, array_agg(average order by subscript) as average
from (
    select
        a.analysis->>'category' category,
        subscript,
        avg(v)::numeric(5,2) as average
    from
        analysis a,
        lateral unnest(
            array(select jsonb_array_elements_text(analysis->'results')::int)
        ) with ordinality s(v,subscript)
    group by 1, 2
) s
group by category
;
 category |                                                 average                                                  
----------+----------------------------------------------------------------------------------------------------------
 news     | {14.33,15.33,16.33,17.33,18.33,19.33,20.33,21.33,22.33,23.33,24.33,25.33,26.33,27.33,45.00,36.00}
 sport    | {61.00,62.00,63.00,64.00,65.00,66.00,67.00,68.00,69.00,70.00,71.00,72.00,73.00,74.00,75.00,76.00}
 weather  | {91.00,92.00,93.00,94.00,95.00,96.00,97.00,98.00,99.00,100.00,101.00,102.00,103.00,104.00,105.00,106.00}

table functions - with ordinality

lateral

Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • Works! With `explain analyse` at start it gives *Planning time: 0.266 ms* and *Execution time: 2.582*. – bobmarksie Mar 10 '16 at 14:23
  • I've read this article (http://blog.heapanalytics.com/dont-iterate-over-a-postgres-array-with-a-loop/) and it mentions that using `unest` is good for performance. How does this perform at scale? – bobmarksie Mar 11 '16 at 14:55
  • @bobmarksie: That article compares `unnest`, which generates a set, with array parsing using subscripts. The later does not scale as the array is parsed repeatedly. – Clodoaldo Neto Mar 11 '16 at 15:06
0

Because the array is always of the same length, you can use generate_series instead of typing the index of every array element yourself. You CROSS JOIN with that generated series so the index is applied to every category and you can get every element at position s from the array. Then it is just aggregating the data using GROUP BY.

The query then becomes:

SELECT category, array_agg(val ORDER BY s) analysis_average
FROM (
  SELECT analysis->'category' category, s, AVG((analysis->'results'->>s)::numeric) val
  FROM analysis
  CROSS JOIN generate_series(0, 15) s
  GROUP BY category,s
) q
GROUP BY category

15 is in this case the last index of the array (16-1).

Mark
  • 1,181
  • 6
  • 18
0

It can be done in more traditional way like

select
  (t.analysis->'category')::varchar,
  array_math_avg(array(select jsonb_array_elements_text(t.analysis->'results')::int))::numeric(9,2)[]
from
  analysis t
group by 1 order by 1;

but we need to do some preparation:

create type t_array_math_agg as(
  c int[],
  a numeric[]
);

create or replace function array_math_sum_f(in t_array_math_agg, in numeric[]) returns t_array_math_agg as $$
declare
  r t_array_math_agg;
  i int;
begin
  if $2 is null then
    return $1;
  end if;
  r := $1;
  for i in array_lower($2,1)..array_upper($2,1) loop
    if coalesce(r.a[i],$2[i]) is null then
      r.a[i] := null;
    else
      r.a[i] := coalesce(r.a[i],0) + coalesce($2[i],0);
      r.c[i] := coalesce(r.c[i],0) + 1;
    end if; 
  end loop;
  return r;
end; $$ immutable language plpgsql;

create or replace function array_math_avg_final(in t_array_math_agg) returns numeric[] as $$
declare
  r numeric[];
  i int;
begin
  if array_lower($1.a, 1) is null then
    return null;
  end if;
  for i in array_lower($1.a,1)..array_upper($1.a,1) loop
    r[i] := $1.a[i] / $1.c[i]; 
  end loop;
  return r;
end; $$ immutable language plpgsql;

create aggregate array_math_avg(numeric[]) (
  sfunc=array_math_sum_f,
  finalfunc=array_math_avg_final,
  stype=t_array_math_agg,
  initcond='({},{})'
);
Abelisto
  • 14,826
  • 2
  • 33
  • 41