I would like to aggregate some columns into an array or json object on redash. The data table is on presto database. I need to query it from pyspark hive. The data table is large and I need to keep its size as small as possible so that I can save the dataframe to s3 faster and then read it as parquet from s3 efficiently.
I am not sure what the best data structure should be for this? (json object? array of array ?)
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I want (this may not be the best data structure):
id. cat_item_aggregation
1. [['furniture', ['living office', ['desk', '4 corners', '1.5'], ['chair', '4 legs', '0.8']], ['restroom', [['tub', 'white wide', '2.7']], ['cloth', ['fashion', ['T-shirt', 'black', '1.1']]]]
I have tried array_agg from PostgreSQL: Efficiently aggregate array columns as part of a group by Postgres - aggregate two columns into one item
also json_build_object from Return as array of JSON objects in SQL (Postgres) How to group multiple columns into a single array or similar?
but they do not work in redash.
Could anybody let me know what the best data structure should be for this kind of table ? and how to build it ?
json may be better than array of array because it is hard to decompose the elements from array of array ?
thanks