1

I have a big database with 1.7 million rows. One column is a list generated by a collect_set. I would like to explode this list into a 1/0 boolean table

PIVOT is not supported by Hive at the moment, so no answer using this function can be accepted.

Table I have:

id   | list_center                              |
-----|------------------------------------------|
0788 | []                                       |
0568 | ["Lorem"]                                |
0879 | ["Lorem","ipsum"]                        |
0025 | ["who", "exercise", "train"]             |
0365 | ["ipsum", "airplane", "tariff", "lorem"] |

Expected result:

id   | lorem | ipsum  | who | exercise | train | airplane | tariff |
-----|-------|--------|-----|----------|-------|----------|--------|
0788 |   0   |   0    |  0  |    0     |   0   |    0     |    0   |
0568 |   1   |   0    |  0  |    0     |   0   |    0     |    0   |
0879 |   1   |   1    |  0  |    0     |   0   |    0     |    0   |
0025 |   0   |   0    |  1  |    1     |   1   |    0     |    0   |
0365 |   1   |   1    |  0  |    0     |   0   |    1     |    1   |

1 Answers1

1

I'm not sure I can answer, but I'll try to shed some light. I have recreated the input table and tried to process it using Spark SQL instead of HiveQL. The syntax in the SQL family is similar, so I hope you'll find useful thoughts.

Basically, I had to "redo" your collect_set result (using explode). So probably you will need to do the pivoting using the dataset before the collect_set transformation.

This would not create a row for "id"=0788, but it's shorter.

SELECT *
FROM (SELECT id, explode(list_center) list_center FROM Table)
PIVOT (
    count(1)
    FOR list_center IN ('lorem', 'ipsum', 'who', 'exercise', 'train', 'airplane', 'tariff')
)
+----+-----+-----+----+--------+-----+--------+------+
|  id|lorem|ipsum| who|exercise|train|airplane|tariff|
+----+-----+-----+----+--------+-----+--------+------+
|0365|    1|    1|null|    null| null|       1|     1|
|0568| null| null|null|    null| null|    null|  null|
|0879| null|    1|null|    null| null|    null|  null|
|0025| null| null|   1|       1|    1|    null|  null|
+----+-----+-----+----+--------+-----+--------+------+

In order to have the missing row I think you would need a cross join.

WITH exploded AS (SELECT id, explode(list_center) list_center, 1 cnt FROM Table)
SELECT *
FROM (SELECT id from Table)
CROSS JOIN (SELECT DISTINCT list_center FROM exploded)
FULL JOIN exploded
USING (id, list_center)
PIVOT (
    coalesce(first(cnt), 0)
    FOR list_center IN ('lorem', 'ipsum', 'who', 'exercise', 'train', 'airplane', 'tariff')
)
+----+-----+-----+---+--------+-----+--------+------+
|  id|lorem|ipsum|who|exercise|train|airplane|tariff|
+----+-----+-----+---+--------+-----+--------+------+
|0365|    1|    1|  0|       0|    0|       1|     1|
|0788|    0|    0|  0|       0|    0|       0|     0|
|0568|    0|    0|  0|       0|    0|       0|     0|
|0879|    0|    1|  0|       0|    0|       0|     0|
|0025|    0|    0|  1|       1|    1|       0|     0|
+----+-----+-----+---+--------+-----+--------+------+

In Oracle, when using pivot, we don't necessarily need to provide all the values, we could simply do FOR list_center IN (). But in Spark SQL it's not possible. Hopefully, HiveQL is flexible on this point.

ZygD
  • 22,092
  • 39
  • 79
  • 102
  • Thank you for your detailled response. Unfortunately Hive doesn't support pivot operations : [SO question](https://stackoverflow.com/questions/18624991/is-there-a-way-to-transpose-data-in-hive), [Hive issue open "support PIVOT"](https://issues.apache.org/jira/browse/HIVE-3776). Moreover, your method forces me to write all column names (first code block, line 5 `FOR list_center IN ...` – Benson_YoureFired Oct 14 '22 at 08:20