3

I have some records where every row belongs to some categories (data type - array of string) and a separate list of unique category(data type - string). I need to match every row with unique list and create flags for it.

Input:
------
ID   Category
1    ["Physics","Math"]
2    ["Math"]
3    ["Math,"Chemistry"]
4    ["Physics","Computer"]

Now I have separate list of unique list of category in excel in local like below:

Unique Category
["Physics"]
["Math"]
["Chemistry"]
["Computer"]

Final Output should look like this:

ID   Category                  Math_F  Physics_F  Computer_F  Chemistry_F
1    ["Physics","Math"]          1         1          0           0
2    ["Math"]                    1         0          0           0
3    ["Math,"Chemistry"]         1         0          0           1
4    ["Physics","Computer"]      0         1          1           0

Can someone please help with query, steps and explanation. I am new to Hive.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
codekiller
  • 53
  • 1
  • 9

1 Answers1

2

Using array_contains():

SELECT ID,
       Category,
       CASE
           WHEN array_contains(Category, 'Math') THEN 1
           ELSE 0
       END Math_F,
       CASE
           WHEN array_contains(Category, 'Physics') THEN 1
           ELSE 0
       END Physics_F,
       CASE
           WHEN array_contains(Category, 'Computer') THEN 1
           ELSE 0
       END Computer_F,
       CASE
           WHEN array_contains(Category, 'Chemistry') THEN 1
           ELSE 0
       END Chemistry_F
FROM TABLE t;

And if you want columns to be constructed dynamically using your array of unique categories, then use some other tool for building query. For example it can be done using shell script.

See this example of building SQL based on pre-defined array. You can easily add your array reading from the file:

#!/bin/bash

#define array
array=( Physics Math Computer Chemistry )

#initial sql
sql="select ID,   
       Category,"

#get length of array
arraylength=${#array[@]}

#get first flag column
columns="CASE
  WHEN array_contains(Category,'${array[0]}') THEN 1 
  ELSE 0 
          END ${array[0]}_F"

#attach all other flags:
for (( i=1; i<=$(( $arraylength-1 )); i++ ))
do 
    columns="$columns,
CASE
    WHEN array_contains(Category,'${array[$i]}') THEN 1
    ELSE 0 
 END ${array[$i]}_F"
done

#final SQL
sql="$sql 
$columns
from table t;
"
#print result
 echo "$sql"

Result:

 SELECT ID,
       Category,
       CASE
           WHEN array_contains(Category, 'Physics') THEN 1
           ELSE 0
       END Physics_F,
       CASE
           WHEN array_contains(Category, 'Math') THEN 1
           ELSE 0
       END Math_F,
       CASE
           WHEN array_contains(Category, 'Computer') THEN 1
           ELSE 0
       END Computer_F,
       CASE
           WHEN array_contains(Category, 'Chemistry') THEN 1
           ELSE 0
       END Chemistry_F
FROM TABLE t;

You can add Hive call to the above script: hive -e "$sql" to execute it, or save it to the file instead.

leftjoin
  • 36,950
  • 8
  • 57
  • 116