1

Say I have some analysis that spits out a wide-form pandas dataframe with a multiindex on the index and columns. Depending on the analysis parameters, the number of columns may change. What is the best design pattern to use to store the outputs in a datajoint table? The following come to mind, each with pros and cons

  1. Reshape to long-form and store single entries with index x column levels as primary keys
  • Pros: Preserves the ability to query/constrain based on both index and columns
  • Cons: Each analysis would insert millions of rows to the table, and I may have to do hundreds of such analyses. Even adding this many rows seems to take several minutes per dataframe, and queries become slow
  1. Keep as wide-form and store single rows as longblob with just index levels as primary keys
  • Pros: Retain ability to query based on index levels, results in tables with a more reasonable number of rows
  • Cons: Loses the ability to query based on column levels, the columns would then also have to be stored somewhere to be able to reconstruct the original dataframes. Since dataframes with different numbers of columns need to be stored in the same table, it is not feasible to explicitly encode all the columns in the table definition
  1. Store the dataframe itself as e.g. an h5 and store it in the database simply as a filepath or as an attachment
  • Pros: Does not result in large databases, simple to implement
  • Cons: Does not really feel in the "spirit" of datajoint, lose the ability to perform constraints or queries

Are there any designs or pros/cons I haven't thought of?

zagaroo
  • 31
  • 4
  • Can you share more about the specifics of the analysis? It would help me to think through the options with a hard example, and info on what analyses might look like downstream. There are perhaps middle grounds between these options, like storing multiindex columns as a single dict. Also, `AttributeAdapter` can help you define unique data types https://github.com/datajoint/datajoint-python/blob/master/datajoint/attribute_adapter.py – Chris Broz Jul 25 '22 at 17:21
  • @ChrisBroz thanks! This is actually a general pattern that applies to multiple analyses that all work like this, but broadly, the rows of the dataframe represent single units, trials, identifiers like that, while the columns are some kind of bin (temporal, etc.)--the number of columns and their identities depends on the bin resolution, which is specified as a parameter by the user. The entries are of float type. – zagaroo Jul 25 '22 at 17:35

1 Answers1

1

Before providing a more specific answer, let's establish a few basics (also known as normal forms).

DataJoint implements the relational data model. Under the relational model, complex dataframes of the type you described require normalization into multiple related tables related to each other through their primary keys and foreign keys.

Each table will represent a single entity class: Units and Trials will be represented in separate tables.

All entities in a given table will have the same attributes (columns). They will be uniquely identified by the same attribute(s) comprising the primary key.

In addition to the primary key, tables may have additional secondary indexes to accelerate queries.

If you already knew about normalization, we can talk how about to normalize your design. If not, we can refer you to a quick tutorial.

  • Hi Dimitri, thanks! Sure, I'm familiar with this--I have glossed over these details, but there will be separate Unit and Trial (etc.) tables which will be used as foreign keys for the analysis table. For boring reasons it is more performant for me to generate the dataframe all at once and then massage it into the RDM, but I can handle that. – zagaroo Jul 26 '22 at 00:56
  • My question is more of a design question: say I have a set of functions already to generate dataframes like this, for a single experiment they may be 1000 Units x 1000 Trials on the index and somewhere between 100-1000 columns (which depends on the analysis parameters, which I have in a dj.Lookup table). I also need to be able to efficiently convert between a datajoint table, constrained a certain way, and wideform pd.DataFrame. Would you say the right design is #1 above? In my hands this seems to be computationally prohibitive – zagaroo Jul 26 '22 at 01:01
  • @zagaroo I would like to understand the experiment and analysis structure better. If you like you can schedule an office hour and we can go through the design. These are recorded and posted on youtube for others to learn. You can schedule here: https://forms.gle/HjxmvLfBc6TuTSKd6 – Dimitri Yatsenko Jul 27 '22 at 14:17