4

I have custom aggregate sum function which accepts boolean data type:

create or replace function badd (bigint, boolean)
  returns bigint as
$body$
select $1 + case when $2 then 1 else 0 end;
$body$ language sql;

create aggregate sum(boolean) (
  sfunc=badd,
  stype=int8,
  initcond='0'
);

This aggregate should calculate number of rows with TRUE. For example the following should return 2 (and it does):

with t (x) as 
   (values 
      (true::boolean), 
      (false::boolean), 
      (true::boolean),
      (null::boolean)      
      )
select sum(x) from t;

However, it's performance is quite bad, it is 5.5 times slower then using casting to integer:

with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector) from t; -- 52012ms

with t as (select (gs > 0.5) as test_vector from generate_series(1,1000000,1) gs)
select sum(test_vector::int) from t; -- 9484ms

Is the only way how to improve this aggregate to write some new C function - e.g. some alternative of int2_sum function in src/backend/utils/adt/numeric.c?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Tomas Greif
  • 21,685
  • 23
  • 106
  • 155
  • That's not an int2-sum. It's a bool-sum, so more like bit-1. And it's wrong, too. It should be: `$1 + $2::int`, or `$1 + case $2 when true then 1 else 0 end`, or something to that order. – Denis de Bernardy Dec 01 '13 at 19:19
  • 1
    I still don't see the reason for a custom aggregate, both `select sum(x::int) from t;` and `select sum(1) from t where x;` are correct and much faster – foibs Dec 01 '13 at 19:35
  • @Denis I updated my question, my aggregate works correctly and does what I want to, but it is slow. – Tomas Greif Dec 01 '13 at 19:43

3 Answers3

4

Your test case is misleading, you only count TRUE. You should have both TRUE and FALSE - or even NULL, if applicable.

Like @foibs already explained, one wouldn't use a custom aggregate function for this. The built-in C-functions are much faster and do the job. Use instead (also demonstrating a simpler and more sensible test):

SELECT count(NULLIF(g%2 = 1, FALSE)) AS ct
FROM  generate_series(1,100000,1) g;

How does this work?
Compute percents from SUM() in the same SELECT sql query

Several fast & simple ways (plus a benchmark) under this related answer on dba.SE:
For absolute performance, is SUM faster or COUNT?

Or faster yet, test for TRUE in the WHERE clause, where possible:

SELECT count(*) AS ct
FROM   generate_series(1,100000,1) g;
WHERE  g%2 = 1             -- excludes FALSE and NULL !

If you'd have to write a custom aggregate for some reason, this form would be superior:

CREATE OR REPLACE FUNCTION test_sum_int8 (int8, boolean)
  RETURNS bigint as
'SELECT CASE WHEN $2 THEN $1 + 1 ELSE $1 END' LANGUAGE sql;

The addition is only executed when necessary. Your original would add 0 for the FALSE case.

Better yet, use a plpgsql function. It saves a bit of overhead per call, since it works like a prepared statement (the query is not re-planned). Makes a difference for a tiny aggregate function that is called many times:

CREATE OR REPLACE FUNCTION test_sum_plpgsql (int8, boolean)
  RETURNS bigint AS
$func$
BEGIN
RETURN CASE WHEN $2 THEN $1 + 1 ELSE $1 END;
END
$func$ LANGUAGE plpgsql;

CREATE AGGREGATE test_sum_plpgsql(boolean) (
  sfunc = test_sum_plpgsql
 ,stype = int8
 ,initcond = '0'
);

Faster than what you had, but much slower than the presented alternative with a standard count(). And slower than any other C-function, too.

->SQLfiddle

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Ooh, you are right, my test case should be: `with t as (select (random() > 0.5) as test_vector from generate_series(1,1000000,1) gs) select sum(test_vector) from t;` I'll try to write my own C function. – Tomas Greif Dec 02 '13 at 09:18
3

I created custom C function and aggregate for boolean:

C function:

#include "postgres.h"
#include <fmgr.h>

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

int
bool_sum(int arg, bool tmp)
{
  if (tmp)
    {
    arg++;
    } 
    return arg;
}

Transition and aggregate functions:

-- transition function
create or replace function bool_sum(bigint, boolean)
  returns bigint
   AS '/usr/lib/postgresql/9.1/lib/bool_agg', 'bool_sum'
  language C strict
  cost 1;
alter function bool_sum(bigint, boolean) owner to postgres;

-- aggregate
create aggregate sum(boolean) (
  sfunc=bool_sum,
  stype=int8,
  initcond='0'
);
alter aggregate sum(boolean) owner to postgres;

Performance test:

-- Performance test - 10m rows

create table tmp_test as (select (case when random() <.3 then null when random() < .6 then true else false end) as test_vector from generate_series(1,10000000,1) gs);

-- Casting to integer
select sum(test_vector::int) from tmp_test;

-- Boolean sum
select sum(test_vector) from tmp_test;

Now sum(boolean) is as fast as sum(boolean::int).

Update:

It turns out that I can call existing C transition functions directly, even with boolean data type. It gets somehow magically converted to 0/1 on the way. So my current solution for boolean sum and average is:

create or replace function bool_sum(bigint, boolean)
  returns bigint as
'int2_sum'
  language internal immutable
  cost 1;


create aggregate sum(boolean) (
  sfunc=bool_sum,
  stype=int8
);

-- Average for boolean values (percentage of rows with TRUE)
create or replace function bool_avg_accum(bigint[], boolean)
  returns bigint[] as
'int2_avg_accum'
  language internal immutable strict
  cost 1;

create aggregate avg(boolean) (
  sfunc=bool_avg_accum,
  stype=int8[],
  finalfunc=int8_avg,
  initcond='{0,0}'
);
Tomas Greif
  • 21,685
  • 23
  • 106
  • 155
  • +1 For providing your interesting solution. Even if it wouldn't be necessary for the presented case, it may be useful for related problems. – Erwin Brandstetter Dec 02 '13 at 15:06
  • 1
    @ErwinBrandstetter I've updated my answer - there is no need for custom C function and still it will be as fast as built in aggregates. – Tomas Greif Dec 06 '13 at 07:59
1

I don't see the real issue here. First of all, using sum as your custom aggregate name is wrong. When you call sum with your test_vector cast to int, the embedded postgres sum is used and not yours, that's why it is so much faster. A C function will always be faster, but I'm not sure you need one in this case.

You could easily drop the badd function and your custom sum use the embedded sum with a where clause

with t as (select 1 as test_vector from generate_series(1,1000000,1) gs where gs > 0.5)
select sum(test_vector) from t;

EDIT:

To sum it up, the best way to optimize your custom aggregate is to remove it if it is not needed. The second best way would be to write a C function.

foibs
  • 3,258
  • 1
  • 19
  • 13