How to transpose/pivot data in hive?

Question

I know there's no direct way to transpose data in hive. I followed this question: Is there a way to transpose data in Hive? , but as there is no final answer there, could not get all the way.

This is the table I have:

 | ID   |   Code   |  Proc1   |   Proc2 | 
 | 1    |    A     |   p      |   e     | 
 | 2    |    B     |   q      |   f     |
 | 3    |    B     |   p      |   f     |
 | 3    |    B     |   q      |   h     |
 | 3    |    B     |   r      |   j     |
 | 3    |    C     |   t      |   k     |

Here Proc1 can have any number of values. ID, Code & Proc1 together form a unique key for this table. I want to Pivot/ transpose this table so that each unique value in Proc1 becomes a new column, and corresponding value from Proc2 is the value in that column for the corresponding row. In essense, I'm trying to get something like:

 | ID   |   Code   |  p   |   q |  r  |   t |
 | 1    |    A     |   e  |     |     |     |
 | 2    |    B     |      |   f |     |     |
 | 3    |    B     |   f  |   h |  j  |     |
 | 3    |    C     |      |     |     |  k  |

In the new transformed table, ID and code are the only primary key. From the ticket I mentioned above, I could get this far using the to_map UDAF. (Disclaimer - this may not be a step in the right direction, but just mentioning here, if it is)

 | ID   |   Code   |  Map_Aggregation   | 
 | 1    |    A     |   {p:e}            |
 | 2    |    B     |   {q:f}            |
 | 3    |    B     |   {p:f, q:h, r:j } |  
 | 3    |    C     |   {t:k}            |

But don't know how to get from this step to the pivot/transposed table I want. Any help on how to proceed will be great! Thanks.

score 16 · Answer 1 · edited Mar 22 '17 at 19:47

16

Here is the approach i used to solved this problem using hive's internal UDF function, "map":

select
    b.id,
    b.code,
    concat_ws('',b.p) as p,
    concat_ws('',b.q) as q,
    concat_ws('',b.r) as r,
    concat_ws('',b.t) as t
from 
    (
        select id, code,
        collect_list(a.group_map['p']) as p,
        collect_list(a.group_map['q']) as q,
        collect_list(a.group_map['r']) as r,
        collect_list(a.group_map['t']) as t
        from (
            select
              id,
              code,
              map(proc1,proc2) as group_map 
            from 
              test_sample
        ) a
        group by
            a.id,
            a.code
    ) b;

"concat_ws" and "map" are hive udf and "collect_list" is a hive udaf.

edited Mar 22 '17 at 19:47

Adam Wagner

15,469
7
52
66

answered Dec 08 '15 at 14:09

Shakti Garg

263
3
15

can you explain how you would do it when the value is of numeric format. I saw your blog but the code doesn't match up with the table. http://hadoopmania.blogspot.com/2015/12/transposepivot-table-in-hive.html – ML_Passion Mar 01 '17 at 16:35
Does this example generalize to cases where you have more than one column to pivot on? – Sledge Apr 21 '17 at 16:40
Does this generalizes to cases where you might have values besides 'p','q','r','t'? – Michael Tuchman Nov 05 '19 at 20:22

Sunny · Accepted Answer · 2014-07-01T17:21:34.240

8

Here is the solution I ended up using:

add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';

select 
    id, 
    code,
    group_map['p'] as p,
    group_map['q'] as q,
    group_map['r'] as r,
    group_map['t'] as t
    from ( select
        id, code,
        collect(proc1,proc2) as group_map 
        from test_sample 
        group by id, code
    ) gm;

The to_map UDF was used from the brickhouse repo: https://github.com/klout/brickhouse

edited Jul 01 '14 at 17:21

answered Apr 25 '14 at 15:22

Sunny

369
2
3
11

I dont see to_map UDF in brickhouse repo. Could you please some more details around this ? – user3056186 Jul 01 '14 at 00:40
here is what i was looking at https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql – user3056186 Jul 01 '14 at 03:20
You can use "collect" UDAF - which is similar to to_map The link: https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/collect/CollectUDAF.java You should replace "to_map" with "collect" I've updated the solution with the same. – Sunny Jul 01 '14 at 17:20
Hi! I am trying to something similar. In your answer, you have group_map['p'] etc, indicating that you know the values ahead of time. How did you solve for not knowing what the values in Proc1 were? Please share. Thanks! – CodingInCircles Oct 08 '15 at 01:13

score 7 · Answer 3 · answered Mar 30 '16 at 01:28

Yet another solution.

Pivot using Hivemall to_map function.

SELECT
  uid,
  kv['c1'] AS c1,
  kv['c2'] AS c2,
  kv['c3'] AS c3
FROM (
  SELECT uid, to_map(key, value) kv
  FROM vtable
  GROUP BY uid
) t

uid c1 c2 c3 101 11 12 13 102 21 22 23

Unpivot

SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
  'c1', c1,
  'c2', c2,
  'c3', c3
)) t2 as key, value

uid key value 101 c1 11 101 c2 12 101 c3 13 102 c1 21 102 c2 22 102 c3 23

score 1 · Answer 4 · answered Apr 14 '14 at 19:12

1

I have not written this code, but I think you can use some of the UDFs provided by klouts brickhouse: https://github.com/klout/brickhouse

Specifically, you could do something like use their collect as mentioned here: http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/

and then explode the arrays (they will be of differing length) using the methods detailed in this post http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra

answered Apr 14 '14 at 19:12

user2726995

2,064
2
21
26

Thanks for the note. I don't need the collect UDAF, as its the same as the Map Aggregation UDAF I'm already using here. I could do the same by using the key names in my map Aggregation as new columns, The real issue is I want it to be dynamic - ie - I do not know how many different "Proc1" values I might end up with, and I want to dynamically create more columns for each new "Proc1" – Sunny Apr 14 '14 at 19:56

score 1 · Answer 5 · edited Aug 07 '19 at 18:09

I have created one dummy table called hive using below query-

create table hive (id Int,Code String, Proc1 String, Proc2 String);

Loaded all the data in the table-

insert into hive values('1','A','p','e');
insert into hive values('2','B','q','f'); 
insert into hive values('3','B','p','f');
insert into hive values('3','B','q','h');
insert into hive values('3','B','r','j');
insert into hive values('3','C','t','k');

Now use the below query to achieve the output.

select id,code,
     case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
     case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
     case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
     case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
     from(
            select id, code,
            case when proc1 ='p' then proc2 end as p,
            case when proc1 ='q' then proc2 end as q,
            case when proc1 ='r' then proc2 end as r,
            case when proc1 ='t' then proc2 end as t
            from hive
        ) dummy group by id,code;

Suman · Answer 6 · 2017-09-12T01:19:57.553

In case of numeric value you can use below hive query:

Sample data

ID  cust_freq   Var1    Var2    frequency
220444  1   16443   87128   72.10140547
312554  6   984 7339    0.342452643
220444  3   6201    87128   9.258396518
220444  6   47779   87128   2.831972441
312554  1   6055    7339    82.15209213
312554  3   12868   7339    4.478333954
220444  2   6705    87128   15.80822558
312554  2   37432   7339    13.02712127

select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id, 
 map(cust_freq,frequency) as group_map 
 from table
 ) a group by a.id having id in 
( '220444',
'312554');

ID  one two three   six
220444  72.10140547 15.80822558 9.258396518 2.831972441
312554  82.15209213 13.02712127 4.478333954 0.342452643

In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.

score 0 · Answer 7 · edited Feb 19 '18 at 12:23

Below is also a way for Pivot

SELECT TM1_Code, Product, Size, State_code, Description
  , Promo_date
  , Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
   , MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
        ,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
        ,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;

It would be helpful, if you can explain your answer. – Rahul Chawla Aug 26 '18 at 10:04 — Rahul Chawla, Aug 26 '18 at 10:04

score 0 · Answer 8 · answered Feb 19 '18 at 11:16

For Unpivot, we can simply use below logic.

SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);

score 0 · Answer 9 · answered Jun 17 '18 at 12:47

You can use case statements and some help from collect_set to achieve this. You can check this out. You can check detail answer at - http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/

Here is the query for reference,

SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;

How to transpose/pivot data in hive?

9 Answers9

Linked

Related