Count distinct over multiple columns in USQL

Question

I'd like to count distinct over two columns.

In imaginary sql, I'd write the below:

COUNT(DISTINCT col1, col2)

In T-SQL, this is done by having one grouping in a sub query and then a second grouping in the outer query - like this

As far as I know, USQL doesn't have sub queries, so I'm a bit stuck trying to work out how to do this. I suppose I could use two aliased queries, but I was wondering if there was a better way.

score 3 · Accepted Answer · answered May 31 '18 at 18:14

U-SQL does support subqueries. What it does not (yet) support are scalar subqueries and correlated subqueries (use a join for both instead).

So the solution

SELECT COUNT(*) AS count
FROM (SELECT DISTINCT DocumentId, DocumentSessionId
      FROM DocumentOutputItems) AS internalQuery

or the mentioned SUM(CASE) with GROUP BY in the link above both work.

Alternatively, U-SQL also allows you to name the inner query and use it in the outer statement which makes the script easier to read for more complex cases (it is similar to a common-table expression in that it does only name the query and does not execute it):

@inner = 
  SELECT DISTINCT DocumentId, DocumentSessionId
  FROM DocumentOutputItems;
@result = SELECT COUNT(*) AS count FROM @inner;

Thanks, I hadn't seen any examples of sub queries in the documentation, so assumed they weren't possible. As you say, the aliased queries are normally much more readable, but I've got a process that's generating usql, to emulate CUBE functionality, so subqueires fit my process better — Neil P, Jun 01 '18 at 08:30

score 1 · Answer 2 · answered May 31 '18 at 10:37

1

If you cannot use subquery then one "ugly" way is to concatenate columns:

SELECT COUNT(DISTINCT String.Concat(col1, '^', col2))
FROM tab;

Of course you have to handle NULLs and convert to proper data type.

answered May 31 '18 at 10:37

Lukasz Szozda

162,964
23
234
275

Count distinct over multiple columns in USQL

2 Answers2