1

I would like to aggregate some columns into an array or json object on redash. The data table is on presto database. I need to query it from pyspark hive. The data table is large and I need to keep its size as small as possible so that I can save the dataframe to s3 faster and then read it as parquet from s3 efficiently.

I am not sure what the best data structure should be for this? (json object? array of array ?)

The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):

id.    cat_name.   cat_desc.       obj_name.    obj_desc.   obj_num
1.     furniture    living office   desk         4 corners    1.5.      
1      furniture.   living office.  chair.       4 legs.      0.8
1.     furniture.   restroom.       tub.         white wide.  2.7
1.     cloth.       fashion.        T-shirt.     black large. 1.1

I want (this may not be the best data structure):

id.     cat_item_aggregation
 1.     [['furniture', ['living office', ['desk', '4 corners', '1.5'], ['chair', '4 legs', '0.8']], ['restroom', [['tub', 'white wide', '2.7']], ['cloth', ['fashion', ['T-shirt', 'black', '1.1']]]]

I have tried array_agg from PostgreSQL: Efficiently aggregate array columns as part of a group by Postgres - aggregate two columns into one item

also json_build_object from Return as array of JSON objects in SQL (Postgres) How to group multiple columns into a single array or similar?

but they do not work in redash.

Could anybody let me know what the best data structure should be for this kind of table ? and how to build it ?

json may be better than array of array because it is hard to decompose the elements from array of array ?

thanks

user3448011
  • 1,469
  • 1
  • 17
  • 39
  • What do you mean by *they do not work*? Do you get an error? Id so the what is the exact and complete error message? Do they just not produce what you expect? If so then exactly what do they produce and exactly what do you expect? Further having 30 (or more) English words means nothing do you actually mean populated columns? Would be best if you posted the actual table definition (ddl). – Belayer Jan 22 '22 at 19:17
  • "the best data structure", if any, highly depends on the way you want to query this aggregate data. Can you give examples of typical queries/criteria to query this data ? Also, did you investigate the VIEW as a solution for your need ? – Edouard Jan 22 '22 at 20:38

0 Answers0