2

I'm just getting into Pandas, and want to figure out a good way of holding time-varying data corresponding to multiple trials.

A concrete example might be:

Trial 1: Salinity = 0.1 (unchanging), pH (at time 1, 2, ...)
Trial 2: Salinity = 0.1 (unchanging), pH (at time 1, 2, ...)
Trial 3: Salinity = 0.2 (unchanging), pH (at time 1, 2, ...)
Trial 4: Salinity = 0.2 (unchanging), pH (at time 1, 2, ...)

Where you'll notice that experiments can be repeated multiple times with the same initial parameters (the salinity), but with different time-varying variables (pH).

A DataFrame is 2-dimensional, so I would have to create a DataFrame for each trial. Is this the best way to go about it, and how would I be able to combine them (ex: get the average pH over time for trials with the same initial setup)?

Teknophilia
  • 758
  • 10
  • 23

1 Answers1

1

You can aggregate the data across Trials in a single pd.DataFrame. Below is an example.

df = pd.DataFrame({'Trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
                   'Date': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
                   'Salinity': [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
                                0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
                   'pH': [2, 4, 1, 4, 6, 8, 3, 2, 9, 3, 1, 4, 6, 11, 4, 6]})

df = df.set_index(['Trial', 'Date', 'Salinity'])

#                      pH
# Trial Date Salinity    
# 1     1    0.1        2
#       2    0.1        4
#       3    0.1        1
#       4    0.1        4
# 2     1    0.1        6
#       2    0.1        8
#       3    0.1        3
#       4    0.1        2
# 3     1    0.2        9
#       2    0.2        3
#       3    0.2        1
#       4    0.2        4
# 4     1    0.2        6
#       2    0.2       11
#       3    0.2        4
#       4    0.2        6

Explanation

  • In your dataframe construction, assign an identifier column, in this case Trial with an integer identifier.
  • Setting index by ['Trial', 'Date', 'Salinity'] provides a natural index for pandas to use for grouping, indexing and slicing.
  • For example, df.loc[(1, 2, 0.1)] will return a pd.Series derived from the dataframe indicating pH = 4.
jpp
  • 159,742
  • 34
  • 281
  • 339
  • jpp: can you say a bit about what you've done with the indexing? – anon01 Mar 17 '18 at 03:26
  • I guess my question was: it is somewhat uncommon to index three columns. What is the advantage to this, in contrast to a single `uid` index? Also, can you say anything about the underlying datastructure of multiple indices/time complexity? I haven't looked closely at indexing. – anon01 Mar 17 '18 at 03:37
  • I see, the advantage is efficiency. Indexing is dictionary-like. If indices are unique you get O(1) complexity work done on it, see [this answer](https://stackoverflow.com/questions/16626058/what-is-the-performance-impact-of-non-unique-indexes-in-pandas) for details. It's important to use *how* you intend to use your data to influence the structure / index of your dataframe. – jpp Mar 17 '18 at 03:42