Say I have some analysis that spits out a wide-form pandas dataframe with a multiindex on the index and columns. Depending on the analysis parameters, the number of columns may change. What is the best design pattern to use to store the outputs in a datajoint table? The following come to mind, each with pros and cons
- Reshape to long-form and store single entries with index x column levels as primary keys
- Pros: Preserves the ability to query/constrain based on both index and columns
- Cons: Each analysis would insert millions of rows to the table, and I may have to do hundreds of such analyses. Even adding this many rows seems to take several minutes per dataframe, and queries become slow
- Keep as wide-form and store single rows as longblob with just index levels as primary keys
- Pros: Retain ability to query based on index levels, results in tables with a more reasonable number of rows
- Cons: Loses the ability to query based on column levels, the columns would then also have to be stored somewhere to be able to reconstruct the original dataframes. Since dataframes with different numbers of columns need to be stored in the same table, it is not feasible to explicitly encode all the columns in the table definition
- Store the dataframe itself as e.g. an h5 and store it in the database simply as a filepath or as an attachment
- Pros: Does not result in large databases, simple to implement
- Cons: Does not really feel in the "spirit" of datajoint, lose the ability to perform constraints or queries
Are there any designs or pros/cons I haven't thought of?