Aggregate ID-property-value records in Hive

Question

I have a table in the following format

ID  Property  Value

1    name      Tim

1    location  USA

1    age       30

2    name      Jack

2    location  UK

2    age       27

And I would like an output in the following format

ID  name  location age

1   Tim    USA     30

2   Jack   UK      27

In python I can do

table_agg = table.groupby('ID')[['Property','Value']].apply(lambda x: dict(x.values))

p = pd.DataFrame(list(table_agg))

How to write the query in Hive?

notNull · Answer 1 · 2019-07-12T20:59:42.323

2

You can use collect_list,map functions to group the data then access the array based on key.

Example:

hive> create table t1(id int,property string,valu string) stored as orc;
hive> insert into t1 values(1,"name","Tim"),(1,"location","USA"),(1,"age","30"),(2,"name","Jack"),(2,"location","UK"),(2,"age","27");

hive> select id,
       va[0]["name"]name,
       va[1]["location"]location,
       va[2]["age"]age 
      from (
           select id,collect_list(map(property,value))va 
               from <table_name> group by id
          )t;

Result:

id      name    location        age
1       Tim     USA             30
2       Jack    UK              27

edited Jul 12 '19 at 20:59

answered Jul 12 '19 at 19:10

notNull

30,258
4
35
50

1

Thanks @Shu! I had an error trying this though. The error message is: 'UDFArgumentTypeException Only primitive type arguments are accepted but map was passed as parameter 1. I guess that is because map is not considered as primitive type in Hive? Any thoughts how to get around this? – lgao Jul 12 '19 at 20:28
@lgao weird!, I added `my create table statements` and able to run the query without any issues in **Hive 1.1 and 2.1**. if you are having `different datatypes` then add your hive create table statement to the question and error that you are getting! – notNull Jul 12 '19 at 21:06
I am also facing same issue FAILED: UDFArgumentTypeException Only primitive type arguments are accepted but map was passed as parameter 1 – vikrant rana Aug 20 '19 at 17:40
@dhnsh. I was looking at below problem to solve it in hive but facing an issue while collecting map values like you did. https://stackoverflow.com/questions/57541507/effective-way-to-groupby-without-using-pivot-in-pyspark?noredirect=1#comment101547274_57541507 – vikrant rana Aug 20 '19 at 17:54

Aggregate ID-property-value records in Hive

1 Answers1