0

Context

I'm building a Postgres extension, which adds possibility to ask fuzzy queries with linguistic variables. For example:

SELECT age~=('age'|>'adult') FROM people;

will return the plausibility of the fact that a person is an adult (adult is defined as a trapezoidal function 30/40~60\65.

Problem

I made a function that allows to return the linguistic name for specified value and linguistic variable:

SELECT age, age~>'age' FROM people;

returns

 age | age~>'age'
-----+--------------
 20  | young adult
 10  | child
 45  | adult
 60  | old

Source of this function and operator looks like this:

CREATE FUNCTION get_fuzzy_name(
  input FLOAT8,
  type_name VARCHAR(64)
) RETURNS VARCHAR(64) AS $$
DECLARE
  type_id fuzzy.types.id%TYPE;
  deg FLOAT8;
  result_name VARCHAR(64);
BEGIN
  type_id := get_fuzzy_type(type_name); -- returns type's id based on it's name
  SELECT
      degree(input, fun) AS d, name
    INTO
      deg, result_name
    FROM fuzzy.functions
    WHERE type=type_id
    ORDER BY d DESC LIMIT 1;
  IF deg=0 THEN
    RETURN NULL;
  END IF;
  RETURN result_name;
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OPERATOR ~> (
  PROCEDURE = get_fuzzy_name,
  LEFTARG = FLOAT8,
  RIGHTARG = VARCHAR(64)
);

The problem is, that for every row, the function above queries for fuzzy type and functions again and again. So I came up with this, as a base for improvements (fuzzy functions are held in a variable):

CREATE TYPE FUZZY_TYPE_FUNCTION AS (
  func  TRAPEZOIDAL_FUNCTION,
  range_name VARCHAR(64)
);
CREATE FUNCTION get_fuzzy_name(
  input FLOAT8,
  type_name VARCHAR(64)
) RETURNS VARCHAR(64) AS $$
DECLARE
  f FUZZY_TYPE_FUNCTION;
  _type_functions FUZZY_TYPE_FUNCTION[] := array(SELECT (fun, name) FROM fuzzy.functions WHERE fuzzy.functions.type=type_name);
  _deg_tmp FLOAT8;
  _deg FLOAT8;
  _result_name VARCHAR(64);
BEGIN
  _deg := 0;
  FOREACH f IN array(_type_functions) LOOP
    _deg_tmp := degree(input, f.func);
    RAISE NOTICE '% && % = %', f, input, _deg_tmp;
    IF _deg<_deg_tmp THEN
      _deg := _deg_tmp;
      _result_name := f.range_name;
      EXIT WHEN _deg=1;
    END IF;
  END LOOP;
  IF _deg=0 THEN
    RETURN NULL;
  END IF;
  RETURN _result_name;
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

Is there a way to get the value of functions table just once per query and cache it, so it could be reused and drastically speed up the whole query?

Additional information

As requested, this are the tables:

CREATE SCHEMA IF NOT EXISTS fuzzy;

CREATE TABLE IF NOT EXISTS fuzzy.types (
  id   SERIAL PRIMARY KEY,
  name VARCHAR(64) UNIQUE
);

CREATE TABLE IF NOT EXISTS fuzzy.functions (
  type INT                  NOT NULL,
  fun  TRAPEZOIDAL_FUNCTION NOT NULL,
  name VARCHAR(64),
  FOREIGN KEY (type) REFERENCES fuzzy.types (id) ON DELETE CASCADE,
  UNIQUE (type, name)
);

fuzzy.types will probably contain a few rows which are id-name pairs, fuzzy.functions will most likely contain 3-10 rows per each type, which for heavy use cases may be about 500 rows (I guess).

Chris Miemiec
  • 755
  • 1
  • 9
  • 19
  • What does the `fuzzy.functions` table look like? Does it hold a few rows only, and if so, can you show or describe what those rows hold? What about `type_id := get_fuzzy_type(type_name)`, how many different results can this return? – 404 Nov 02 '18 at 10:14
  • 1
    As the function is declared `immutable` it will only be called once for each pair of input values. The problem is: it shouldn't be declared as `immutable` as it accesses tables in the database. You can only declare it as `stable` and in that case the results for the same input should also be cached. –  Nov 02 '18 at 12:39
  • Thank you! @eurotrash I've added additional information. I also made a benchmark showing that the first query completes in 24s for ~1M rows and the second one is 2 times slower. I wouldn't mind rewriting the whole mechanism to something else, but for now I wonder if there is possibility to cache just one variable in a function. I know that aggregates can store state, but it's probably not applicable here. – Chris Miemiec Nov 02 '18 at 14:31

1 Answers1

1

You may be building on a couple of misleading assumptions about function performance.

Try this simplified SQL function instead:

CREATE FUNCTION get_fuzzy_name(_input FLOAT8, _type_name VARCHAR(64))
  RETURNS VARCHAR(64) AS
$func$
   SELECT f.name
   FROM   fuzzy.functions f
   JOIN   fuzzy.types     t ON t.id = f.type
   WHERE  t.name = _type_name
   AND    degree(_input, f.fun) > 0
   ORDER  BY degree(_input, f.fun) DESC
   LIMIT  1;
$func$  LANGUAGE sql STABLE;
  • LANGUAGE sql. No variables, assignments, IF constructs, ... 1 simple query. Complete rewrite, but should be equivalent.

  • STABLE, not IMMUTABLE.

  • No nested function call at all. Replaced with a join. Should be cheaper.

  • It might be cheaper, yet, to inline the undisclosed degree() function as well. May even burn down to a much faster "nearest neighbor" query. Not enough info.

  • This function can be inlined as opposed to your original. I removed STRICT, which might be in the way. Can't tell, not enough info.

See the Postgres Wiki about inlining of scalar functions.
And:

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks! The change to plain SQL function type has improved performance from ~25s to ~20s for the same data set. I'd like to improve it further by rewriting the whole mechanism to C later - I'll post results here when it's ready. – Chris Miemiec Nov 03 '18 at 19:29