SQL CLR aggregate not terminating correctly when applied over huge amount of data

Question

I have create and used a lot of times a SQL CLR aggregate which is concatenating values - it also order the values by specified number and use user input separator for concatenating the them.

I have used the same aggregate over large amount of data and notice that the separator is not used - the values are concatenated but without the separator.

After a lot of tests, I found that in the Terminate method, the delimiter is missing/not read. I have double check this using hard-coded separator - everything worked fine.

I guess that there is something wrong with my Read and Write method (used when large amount of data is handled) but not able to understand what.

Here is the function code:

[Serializable]
[
    Microsoft.SqlServer.Server.SqlUserDefinedAggregate
    (
        Microsoft.SqlServer.Server.Format.UserDefined,
        IsInvariantToNulls = true,
        IsInvariantToDuplicates = false,
        IsInvariantToOrder = false,
        IsNullIfEmpty = false,
        MaxByteSize = -1
    )
]
/// <summary>
/// Concatenates <int, string, string> values defining order using the specified number and using the given delimiter
/// </summary>
public class ConcatenateWithOrderAndDelimiter : Microsoft.SqlServer.Server.IBinarySerialize
{
    private List<Tuple<int, string>> intermediateResult;
    private string delimiter;
    private bool isDelimiterNotDefined;

    public void Init()
    {
        this.delimiter = ",";
        this.isDelimiterNotDefined = true;
        this.intermediateResult = new List<Tuple<int, string>>();
    }

    public void Accumulate(SqlInt32 position, SqlString text, SqlString delimiter)
    {
        if (this.isDelimiterNotDefined)
        {
            this.delimiter = delimiter.IsNull ? "," : delimiter.Value;
            this.isDelimiterNotDefined = false;
        }

        if (!(position.IsNull || text.IsNull))
        {
            this.intermediateResult.Add(new Tuple<int, string>(position.Value, text.Value));
        }
    }

    public void Merge(ConcatenateWithOrderAndDelimiter other)
    {
        this.intermediateResult.AddRange(other.intermediateResult);
    }

    public SqlString Terminate()
    {
        this.intermediateResult.Sort();
        return new SqlString(String.Join(this.delimiter, this.intermediateResult.Select(tuple => tuple.Item2)));
    }

    public void Read(BinaryReader r)
    {
        if (r == null) throw new ArgumentNullException("r");

        int count = r.ReadInt32();
        this.intermediateResult = new List<Tuple<int, string>>(count);

        for (int i = 0; i < count; i++)
        {
            this.intermediateResult.Add(new Tuple<int, string>(r.ReadInt32(), r.ReadString()));
        }

        this.delimiter = r.ReadString();
    }

    public void Write(BinaryWriter w)
    {
        if (w == null) throw new ArgumentNullException("w");

        w.Write(this.intermediateResult.Count);

        foreach (Tuple<int, string> record in this.intermediateResult)
        {
            w.Write(record.Item1);
            w.Write(record.Item2);
        }

        w.Write(this.delimiter);
    }
}

@mjwills It is called automatically. This methods are handle but the engine - https://learn.microsoft.com/en-us/sql/relational-databases/clr-integration-database-objects-user-defined-functions/clr-user-defined-aggregates-requirements?view=sql-server-2017. — gotqn, Aug 22 '18 at 09:50
@mjwills Yes, I know, but it's very difficult to set up. First, you need to have `SQL Server`, then you need to activate and create the SQL CLR aggregate using the code above, and then, to send you somehow huge amount of data to test, because it is working fine with small amounts. Basically, I am hoping someone who is familiar with sql-clr to give some ideas of what can be going wrong. — gotqn, Aug 22 '18 at 09:58
If somebody find it useful: Starting from MSSQL 2017 there is a built in string concatenate aggregate called: [String_Agg](https://learn.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql?view=sql-server-2017) — Tamás Kecskeméti, Sep 19 '18 at 20:06

score 1 · Answer 1 · answered Aug 22 '18 at 10:05

I have found the issue. It was in the Merge method. It was:

public void Merge(ConcatenateWithOrderAndDelimiter other)
{
    this.intermediateResult.AddRange(other.intermediateResult);
}

and I change it to:

public void Merge(ConcatenateWithOrderAndDelimiter other)
{
    this.intermediateResult.AddRange(other.intermediateResult);
    this.delimiter = other.delimiter;
}

It seems that when data is merge, the delimiter is not initialized. I suppose in the above context, all this attributes are null.

Anyway, I am not going to accept this as answer, because it will be helpful if anyone is able to explain what is going on internally.

score 1 · Accepted Answer · answered Sep 19 '18 at 19:48

The Merge() method is invoked only when parallelism is used and a particular group is spread across more than 1 thread. In this case, Init() has been called, and 0 or more instances of Accumulate().

So, in the case of parallelism, if Init() has been called but no Accumulate() has been called yet, then the value in delimiter would be what was set in the Init() method. The code in the question shows that it is being set to ,, but I suspect that was added later in trying to figure this out. Of course, this assumes that a comma is being passed in as the delimiter into Accumulate(). Or perhaps the comma was always being set as the default in Init(), but another character was passed in via Accumulate() and that was not coming through the final output (the specific call to the UDA is not shown in the question, nor is the incorrect output, so there is some ambiguity here).

While the fix shown in the other answer appears to work, it is not a universal fix given that there could be a case where the current object has had Accumulate() called at least once, but the "other" object being merged into this one is still empty (no matching rows perhaps, or some other reason that values were not stored locally when Accumulate() was called). In that case, the current object would have the desired delimiter but the "other" object would still have the default. The ideal solution would be to also store the value of isDelimiterNotDefined in the Write() method, get it back out again in the Read() method, and compare the local value to other.isDelimiterNotDefined in the Merge() method so that you can determine if you should keep the local or other value of delimiter (depending on which one is set / defined).

So, because of the huge amount of data parallelism is used. Wouldn't be best to read/write the delimiter in the `read()/write()` methods, too? — gotqn, Sep 21 '18 at 05:06
@gotqn You already do read/write the delimiter in the `read()` and `write()` methods ;-). `this.delimiter = r.ReadString();` and `w.Write(this.delimiter);`. That is why it works correctly when there is no parallelism. Data is written via `Write()` when the last row is collected during aggregation, or when that thread finishes its group of rows. And then to calculate and display the aggregated result, `Read()` is called and then `Terminate()`. — Solomon Rutzky, Sep 21 '18 at 05:50
@gotqn You should be able to test this by undoing the hard-coded delimiter so that you see the problem with the large data set, then add `OPTION (MAXDOP 1)` to the query to disable parallelism and it _should_ work again. Please try that and let me know if it behaves the way that I am describing. Thanks. — Solomon Rutzky, Sep 21 '18 at 05:51
Thanks for the help and the ideas. You have always been great contributor for all of my SQL CLR questions. — gotqn, Oct 19 '18 at 05:10
@gotqn You are quite welcome. Questions are an opportunity for me to learn, so this has been good for both of us :-). Just curious: did you try my suggested test of putting the code back to reproduce the problem, and then add `OPTION (MAXDOP 1)` to the query to see if that fixes it? — Solomon Rutzky, Oct 25 '18 at 14:37

SQL CLR aggregate not terminating correctly when applied over huge amount of data

2 Answers2