Building on Jonathan Leffler example and on RET comments about the ordering of the concatenated values, using Informix 12.10FC8DE, I came up with the following user aggregate:
CREATE FUNCTION mgc_init
(
dummy VARCHAR(255)
)
RETURNING
SET(LVARCHAR(2048) NOT NULL);
RETURN SET{}::SET(LVARCHAR(2048) NOT NULL);
END FUNCTION;
CREATE FUNCTION mgc_iter
(
p_result SET(LVARCHAR(2048) NOT NULL)
, p_value VARCHAR(255)
)
RETURNING
SET(LVARCHAR(2048) NOT NULL);
IF p_value IS NOT NULL THEN
INSERT INTO TABLE(p_result) VALUES (TRIM(p_value));
END IF;
RETURN p_result;
END FUNCTION;
CREATE FUNCTION mgc_comb
(
p_partial1 SET(LVARCHAR(2048) NOT NULL)
, p_partial2 SET(LVARCHAR(2048) NOT NULL)
)
RETURNING
SET(LVARCHAR(2048) NOT NULL);
INSERT INTO TABLE(p_partial1)
SELECT vc1 FROM TABLE(p_partial2)(vc1);
RETURN p_partial1;
END FUNCTION;
CREATE FUNCTION mgc_fini
(
p_final SET(LVARCHAR(2048) NOT NULL)
)
RETURNING
LVARCHAR;
DEFINE l_str LVARCHAR(2048);
DEFINE l_value LVARCHAR(2048);
LET l_str = NULL;
FOREACH SELECT vvalue1 INTO l_value FROM TABLE(p_final) AS vt1(vvalue1) ORDER BY vvalue1
IF l_str IS NULL THEN
LET l_str = l_value;
ELSE
LET l_str = l_str || ',' || l_value;
END IF;
END FOREACH;
RETURN l_str;
END FUNCTION;
GRANT EXECUTE ON mgc_fini TO PUBLIC;
CREATE AGGREGATE m_group_concat
WITH
(
INIT = mgc_init
, ITER = mgc_iter
, COMBINE = mgc_comb
, FINAL = mgc_fini
);
The concatenated values will have no duplicates and will be ordered.
I used Informix collections
, namely SET
that does not allow duplicate values, to try to keep the code somewhat simple.
The method is to use SET
's to keep the intermediate results (and eliminating the duplicates) and at the end build the concatenated string from the ordered values of the final SET
.
The use of LVARCHAR
for the SET
elements is due to the fact that initially i was using VARCHAR
but the memory consumption was very, very high. The documentation hints that internally Informix may be casting the VARCHAR
to CHAR
. I made the change and it did in fact lower the memory consumption (but it is still high).
However, this aggregate memory consumption is around 2 orders of magnitude higher than Jonathan's and about 2 times slower on the tests i conducted (using a table with around 300 000 rows).
So use with care. It consumes a lot of memory and it is not extensively tested ( it may be leaking memory somewhere ).
EDIT 1:
My previous code must be leaking a memory structure somewhere (or internally Informix keeps the collection derived tables around, and it can generated a lot of those).
So, still trying to avoid having to code the aggregate function in C
, here is another alternative, using Informix BSON
built in functions, that will use much less memory and be a bit faster.
CREATE FUNCTION m2gc_init
(
dummy VARCHAR(255)
)
RETURNING
BSON;
RETURN '{"terms":[]}'::JSON::BSON;
END FUNCTION;
CREATE FUNCTION m2gc_iter
(
p_result BSON
, p_value VARCHAR(255)
)
RETURNING
BSON;
DEFINE l_add_array_element LVARCHAR(2048);
IF p_value IS NOT NULL THEN
LET l_add_array_element = '{ $addToSet: { terms: "' || TRIM(p_value) || '" } }';
LET p_result = BSON_UPDATE(p_result, l_add_array_element);
END IF;
RETURN p_result;
END FUNCTION;
CREATE FUNCTION m2gc_comb
(
p_partial1 BSON
, p_partial2 BSON
)
RETURNING
BSON;
DEFINE l_array_elements LVARCHAR(2048);
DEFINE l_an_element LVARCHAR(2048);
DEFINE l_guard INTEGER;
LET l_array_elements = NULL;
LET l_guard = BSON_SIZE(p_partial2, 'terms.0');
IF l_guard > 0 THEN
WHILE l_guard > 0
LET l_an_element = BSON_VALUE_LVARCHAR(p_partial2, 'terms.0');
IF l_array_elements IS NULL THEN
LET l_array_elements = '"' || l_an_element || '"';
ELSE
LET l_array_elements = l_array_elements || ', "' || l_an_element || '"';
END IF;
LET p_partial2 = BSON_UPDATE(p_partial2, '{ $pop: { terms: -1 } }');
LET l_guard = BSON_SIZE(p_partial2, 'terms.0');
END WHILE;
LET l_array_elements = '{ $addToSet: { terms: { $each: [ ' || l_array_elements || ' ] } } }';
LET p_partial1 = BSON_UPDATE(p_partial1, l_array_elements);
END IF;
RETURN p_partial1;
END FUNCTION;
CREATE FUNCTION m2gc_fini
(
p_final BSON
)
RETURNING
LVARCHAR;
DEFINE l_str_agg LVARCHAR(2048);
DEFINE l_an_element LVARCHAR(2048);
DEFINE l_iter_int INTEGER;
DEFINE l_guard INTEGER;
LET l_str_agg = NULL;
LET l_guard = BSON_SIZE(p_final, 'terms.0');
IF l_guard > 0 THEN
LET p_final = BSON_UPDATE(p_final, '{ $push: { terms: { $each: [], $sort: 1 } } }');
LET l_iter_int = 0;
WHILE l_guard > 0
LET l_an_element = BSON_VALUE_LVARCHAR(p_final, 'terms.' || l_iter_int);
IF l_str_agg IS NULL THEN
LET l_str_agg = TRIM(l_an_element);
ELSE
LET l_str_agg = l_str_agg || ',' || TRIM(l_an_element);
END IF;
LET l_iter_int = l_iter_int + 1;
LET l_guard = BSON_SIZE(p_final, 'terms.' || l_iter_int);
END WHILE;
END IF;
RETURN l_str_agg;
END FUNCTION;
CREATE AGGREGATE m2_group_concat
WITH
(
INIT = m2gc_init
, ITER = m2gc_iter
, COMBINE = m2gc_comb
, FINAL = m2gc_fini
)
;
The aggregated return value will be ordered and without duplicates.
Again, this was not properly tested. It is a just a POC.
One of the problems is that it is not sanitizing the input values.
Some of the BSON
manipulating functions receive parameters that are being built by concatenating strings and non escaped characters can break those parameters.
For example, a string value with quotes on it: 'I"BrokeIt'
) can provoke an assortment of errors (Assert Failures included).
And i am certain there are other problems.
However, memory consumption of this implementation is in the same order of magnitude as in Jonathan's example and around 60% slower (again, only very rudimentary testing was performed).