1

My dataset is of the form of instances of series data, each with associated metadata. Similar to a CD collection where each CD track has metadata (artist, album, length, etc.) and a series of audio data. Or imagine a road condition survey dataset - each time a survey is conducted the metadata of location, date, time, operator, etc. is recorded, as well as some physical series data of the road condition for each unit length of road. The collection of surveys ({metadata, data} pairs) is the dataset.

I'd like to take advantage of pandas to help import, store, search and analyse that dataset. pandas does not have built-in support for this type of dataset, but many have tried to add it.

The typical solutions are either:

  1. Add metadata to a pandas DataFrame, but this is the wrong way around - I want a collection of metadata records each with associated data, not data with associated metadata.

  2. Casting data to be valid field in a DataFrame and storing it as one of the metadata fields, but the casting process discards significant integrity.

  3. Using multiple indices to create a 3D DataFrame, but this imposes design details on your choice of index, which limits experimentation.

This sort of dataset is very common, and I see a lot of people trying to bend pandas to accommodate it. I wonder what the right approach is, or even if pandas is the right tool.

Community
  • 1
  • 1
Heath Raftery
  • 3,643
  • 17
  • 34

1 Answers1

0

I now have a working solution, but since I haven't seen this method documented I wonder if there be dragons ahead.

My "database" is a pandas DataFrame that looks something like this: | | Description | Time | Length | data_uuid | | 0 | My first record | 2017-03-09 11:00:00 | 502 | f7ee-11e6-b702 | | 1 | My second record | 2017-03-10 11:00:00 | 551 | f7ee-11e6-a996 |

That is, my metadata are rows of a DataFrame, which gives me all the power of pandas, but my data is given an uuid on importation. The data for each metadata is actually a separate DataFrame, serialised to a file whose name is the uuid.

That way, an illustrative example of looking up a record and pulling out the data looks like this:

display(df_database[df_database['Length'] >= 550.0])
idx = df_database[df_database['Length'] >= 550.0].index[0]
df_data = pd.read_pickle(DATA_DIR + str(df_database.at[idx, 'data_uuid']))
display(df_data)

With suitable importation, storage and lookup functions, this seems to give me the power (with associated cumbersomeness!) of pandas without pulling too many restrictive tricks.

Heath Raftery
  • 3,643
  • 17
  • 34