hive collect_set crashes query

Question

I've got the following table:

hive> describe tv_counter_stats;
OK
day     string  
event   string  
query_id      string  
userid    string  
headers       string

And I want to perform the following query:

hive -e 'SELECT 
    day,
    event,
    query_id,
    COUNT(1) AS count,
    COLLECT_SET(userid)
FROM
    tv_counter_stats
GROUP BY 
    day, 
    event, 
    query_id;' > counter_stats_data.csv

However, this query fails. But the following query works fine:

hive -e 'SELECT 
    day,
    event,
    query_id,
    COUNT(1) AS count
FROM
    tv_counter_stats
GROUP BY 
    day, 
    event, 
    query_id;' > counter_stats_data.csv

where I remove the collect_set command. So my question: Has anybody an idea why collect_set might fail in this case?

UPDATE: Error message added:

Diagnostic Messages for this Task:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 3  Reduce: 1   Cumulative CPU: 10.49 sec   HDFS Read: 109136387 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 10 seconds 490 msec

java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)

Error: GC overhead limit exceeded
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)

Error: GC overhead limit exceeded

UPDATE 2: I altered the query such that it look now like this:

hive -e '
SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC";
SELECT 
    day,
    event,
    query_id,
    COUNT(1) AS count,
    COLLECT_SET(userid)
FROM
    tv_counter_stats
GROUP BY 
    day, 
    event, 
    query_id;' > counter_stats_data.csv

However, then I get the following error:

Diagnostic Messages for this Task:
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 3  Reduce: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Can you add the failure message? – Nigel Tufnel Jan 13 '14 at 19:04 — Nigel Tufnel, Jan 13 '14 at 19:04

score 1 · Answer 1 · edited May 23 '17 at 11:59

1

This is probably the memory problem, since collect_set aggregates data in the memory.

Try increasing heap size and enabling concurrent GC (via setting Hadoop mapred.child.java.opts to e.g -Xmx1g -XX:+UseConcMarkSweepGC).

This answer has more information about "GC overhead limit" error.

edited May 23 '17 at 11:59

Community

1
1

answered Jan 14 '14 at 12:29

Nigel Tufnel

11,146
4
35
31

score 1 · Answer 2 · answered Mar 26 '15 at 19:37

I had the same exact problem and came across this question, so I thought I'd share the solution I found.

The underlying problem is most likely that Hive is trying to do the aggregation on the mapper side, and the heuristics it uses to manage the in-memory hashmaps for that approach are thrown off by data that is "wide but shallow" -- i.e. in your case, if there are very few user_id values per day/event/query_id group.

I found an article that explains various ways to address this issue, but most of them are just optimizations to the full-out nuclear option: disable mapper-side aggregations entirely.

Using set hive.map.aggr = false; should do the trick.

hive collect_set crashes query

2 Answers2