Group by a generated column

Question

I'm trying to group data by minutes, so I tried this query:

SELECT FROM_UNIXTIME(
     unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') as ts,
     count (*) as cnt 
     from toucher group by ts limit 10;

Then hive tells me no such column,

FAILED: SemanticException [Error 10004]: Line 1:134 Invalid table alias or column reference 'ts': (possible column names are: time, ip, username, code)

So is it not supported by hive?

The column TS is not in scope here ! you need to write this in outer query. — Ashutosh Arya, Sep 17 '13 at 06:05

Andrey Khmelev · Accepted Answer · 2013-09-17T08:09:56.457

6

SELECT FROM_UNIXTIME(unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') as ts,
     count (*) as cnt 
from toucher 
group by FROM_UNIXTIME(unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') limit 10;

or and better

 select t.ts, count(*) from
(SELECT FROM_UNIXTIME(unix_timestamp (time, 'yyyy-mm-dd hh:mm:ss'), 'yyyy-mm-dd hh:mm') as ts             
    from toucher ) t
    group by t.ts limit 10;

edited Sep 17 '13 at 08:09

answered Sep 17 '13 at 06:08

Andrey Khmelev

1,141
8
13

The second one is not working ;-( `SemanticException [Error 10025]: Line 1:37 Expression not in GROUP BY key ''yyyy-mm-dd hh:mm:ss''` So the column seems invalid – daisy Sep 17 '13 at 07:00
Would you mind explaining why you think one approach is better than the other? – Lukas Vermeer Sep 17 '13 at 07:55
Subquery of the second query contains a `COUNT(*)` but no `GROUP BY`. Are you sure that's correct? What are you counting? – Lukas Vermeer Sep 17 '13 at 07:57
corrected secondary query, thnx. @LukasVermeer the second query is better because functions FROM_UNIXTIME and unix_timestamp are executing once – Andrey Khmelev Sep 17 '13 at 08:13
Indeed `FROM_UNIXTIME` is only executed once in the second approach, but you also have two nested queries and some intermediate data of unknown volume. What makes you think the `FROM_UNIXTIME` call has a bigger performance hit than the subquery? – Lukas Vermeer Sep 17 '13 at 08:51
@LukasVermeer function FROM_UNIXTIME in clause WHERE is bad way in optimization of query. If you have many rows in table (~100000) it will be critical in our PLAIN – Andrey Khmelev Sep 17 '13 at 09:06
@LukasVermeer look this article (on russian) http://tarlyun.com/blog/2011/03/18/from_unixtime-v-uslovii-zlo/ it's talking about FROM_UNIXTIME in WHERE – Andrey Khmelev Sep 17 '13 at 09:08
1

@mr.dump I don't speak Russian, but that article seems to concern MySQL, not Hive? It also seems to suggest that using `FROM_UNIXTIME` in the `WHERE` clause is bad for performance because it forces (in MySQL) to do a full table scan? Hive is very different from MySQL, I'm not sure this particular performance argument is relevant here. – Lukas Vermeer Sep 19 '13 at 09:07
1

@LukasVermeer I know many DB and all there are working about the same. Thats say Tom Kyte too. But it is always possible to test. – Andrey Khmelev Sep 20 '13 at 19:47
1

@mr.dump My point was that Hive is not really a database; it does not work the same as MySQL. Your argument about why one approach is better than the other does not apply here. – Lukas Vermeer Oct 15 '13 at 10:49

score 3 · Answer 2 · edited May 23 '17 at 12:25

As is the case with most relational database systems, the SELECT clause is processed after the GROUP BY clause. This means you cannot use columns aliased in the SELECT (such as ts in this example) in your GROUP BY.

There are essentially two ways around this. Both are correct, but some people have preference for one over the other for various reasons.

First, you could group by the original expression, rather than the alias. This results in duplicate code, as you will have the exact same expression in both your SELECT and GROUP BY clause.

SELECT 
    FROM_UNIXTIME(unix_timestamp(time,'yyyy-mm-dd hh:mm:ss'),'yyyy-mm-dd hh:mm') as ts,
    COUNT(*) as cnt 
FROM toucher 
GROUP BY FROM_UNIXTIME(unix_timestamp(time,'yyyy-mm-dd hh:mm:ss'),'yyyy-mm-dd hh:mm')
LIMIT 10;

A second approach is to wrap your expression and alias in a subquery. This means you do not have to duplicate your expression, but you will have two nested queries and this may have performance implications.

SELECT 
    ts,
    COUNT(*) as cnt 
FROM 
    (SELECT
        FROM_UNIXTIME(unix_timestamp(time,'yyyy-mm-dd hh:mm:ss'),'yyyy-mm-dd hh:mm') as ts,
     FROM toucher) x
GROUP BY x.ts
LIMIT 10;

Both should have the same result. Which you should use in this case will depend on your particular use; or perhaps personal preference.

Hope that helps.

Group by a generated column

2 Answers2