148

Is it possible to add some meta-information/metadata to a pandas DataFrame?

For example, the instrument's name used to measure the data, the instrument responsible, etc.

One workaround would be to create a column with that information, but it seems wasteful to store a single piece of information in every row!

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
P3trus
  • 6,747
  • 8
  • 40
  • 54
  • 5
    Please note the @ryanjdillon answer (currently buried near the bottom) which mentions the updated experimental attribute 'attrs' which seems like a start, maybe – JohnE Aug 14 '20 at 17:37
  • 1
    You can register custom accessors: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-register-accessors – SiP Feb 24 '22 at 14:57

13 Answers13

108

Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:

import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'

Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join, assign or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.

Preserving the metadata in a file is possible. You can find an example of how to store metadata in an HDF5 file here.

zabop
  • 6,750
  • 3
  • 39
  • 84
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 5
    +1 for you choice of instrument name! Do you have any experience trying to dump these extra attributes into HDFStore? – Dan Allan Apr 04 '13 at 14:40
  • 4
    @DanAllan: If `store = pd.HDFStore(...)`, then attributes can be stored with `store.root._v_attrs.key = value`. – unutbu Apr 04 '13 at 16:44
  • 3
    To anyone else who might use this: the docs have added a section on this. http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore – Dan Allan Apr 11 '13 at 18:50
  • 1
    The [cookbook](http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore) or [this answer](http://stackoverflow.com/questions/17820071/storing-pandas-objects-along-with-regular-python-objects-in-hdf5) do not explain how to automatically add all attributes that you added to the `DataFrame` to the `HDFStore`, though. – j08lue Oct 08 '14 at 14:34
  • 4
    [For posterity, this is will not be maintained during pickling in v0.18.1](http://stackoverflow.com/questions/31727333/get-the-name-of-the-dataframe-python/31727504#comment70185470_31727504). – tmthydvnprt Jan 05 '17 at 16:43
  • 8
    In pandas 0.23.1, creating a new attribute by assigning a dictionary, list, or tuple gives a warning (i.e. `df = pd.DataFrame(); df.meta = {}` produces `UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`). (No warning is given if the attribute has already been created as in `df = pd.DataFrame(); df.meta = ''; df.meta = {}`). – teichert Jun 26 '18 at 19:19
  • updated cookbook link: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#hdfstore-where-string-comparison – charlesreid1 Jul 25 '18 at 07:21
  • 2
    Most surprisingly, `df.copy()` does not retain custom attributes either. – Joooeey Sep 06 '18 at 22:42
  • 3
    There is currently an experimental attribute `.attrs` which is supposed to do this: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#metadata – Jeremiah England Mar 19 '20 at 16:38
72

As of pandas 1.0, possibly earlier, there is now a Dataframe.attrs property. It is experimental, but this is probably what you'll want in the future. For example:

import pandas as pd
df = pd.DataFrame([])
df.attrs['instrument_name'] = 'Binky'

Find it in the docs here.

Trying this out with to_parquet and then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.

stefanbschneider
  • 5,460
  • 8
  • 50
  • 88
ryanjdillon
  • 17,658
  • 9
  • 85
  • 110
  • This is interesting and does seem to persist for copy/loc/iloc, but not for groupby. – JohnE Aug 14 '20 at 17:35
  • 1
    Just a suggestion, but maybe show an example of how to use it? The documentation is basically nothing, but just from playing around with it I can see that it is initialized as an empty dictionary and it seems to be set up so that it has to be a dictionary although of course one could nest a list inside it, for example. – JohnE Aug 14 '20 at 17:43
  • 3
    You may find this [Stackoverflow discussion](https://stackoverflow.com/questions/52122674/how-to-write-parquet-metadata-with-pyarrow) useful as it demonstrates how to add custom metadata to parquet files if required – rdmolony Aug 29 '20 at 09:17
  • 3
    @rdmolony That's great. I think using a `dataclass` for the metadata and then subclassing `DataFrame` to have a method doing the load/dumping as in the post you shared could be a nice solution. – ryanjdillon Aug 30 '20 at 08:53
  • 3
    This is nice. In contrast to the accepted answer, this does preserve attributes after saving and loading from pickle! – stefanbschneider Oct 21 '20 at 09:27
  • 2
    It is not persistent when stored via `to_feather()`. – buhtz Jun 08 '21 at 11:55
14

Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).

follyroof
  • 3,430
  • 2
  • 28
  • 26
  • 18
    `_metadata` is not part of the public API, so I would strongly recommend against relying on this functionality. – shoyer Jan 20 '15 at 20:30
  • @Stephan can you elaborate on that please? Why is it important to be a part of the public API? Is your statement also true for version 0.15? – TomCho Nov 06 '15 at 13:06
  • @Stephan Sorry, I found an answer of yours elaborating on this: http://stackoverflow.com/a/28054711. But is that still true today? Are there no better alternatives than to build a wrapper? – TomCho Nov 06 '15 at 14:10
  • 1
    @TomCho yes, that answer is still true today. You might take a look at xray (http://github.com/xray/xray) for one alternative example of a labeled array that supports metadata, especially if you have multi-dimensional data (`.attrs` is part of the xray API) – shoyer Nov 09 '15 at 06:23
  • 19
    `_metadata` is actually a class attribute, not an instance attribute. So new `DataFrame` instances inherit from previous ones, as long as the module stays loaded. Do not use `_metadata` for anything. +1 for `xarray`! – j08lue Dec 22 '16 at 10:53
  • 1
    _metadata -- an unsupported feature that saved my day! Thank you. – joctee Dec 12 '18 at 10:35
  • `_metadata` _is_ part of the documented API (despite the underscore), and has been for at least five years, see pandas commit 134f1775. https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties While it's true that `_metadata` is a class attribute, it is used to enable instance attributes. – nemetroid Jan 06 '21 at 14:29
13

Not really. Although you could add attributes containing metadata to the DataFrame class as @unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub: https://github.com/pydata/pandas/issues/2485

There is currently an open pull request to add a MetaDataFrame object, which would support metadata better.

Matti John
  • 19,329
  • 7
  • 41
  • 39
11

The top answer of attaching arbitrary attributes to the DataFrame object is good, but if you use a dictionary, list, or tuple, it will emit an error of "Pandas doesn't allow columns to be created via a new attribute name". The following solution works for storing arbitrary attributes.

from types import SimpleNamespace
df = pd.DataFrame()
df.meta = SimpleNamespace()
df.meta.foo = [1,2,3]
bscan
  • 2,816
  • 1
  • 16
  • 16
  • Also, if you want this to persist across copies of your dataframe, you need to do `pd.DataFrame._metadata += ["meta"]` . Note that this part is a attribute of Pandas, not an attribute of your specific dataframe – bscan Feb 19 '19 at 23:43
  • This approach won't work anymore as `df.meta` triggers a warning that Pandas does not allow new columns to be generated this way. – anishtain4 Sep 10 '19 at 15:01
  • @anishtain4, I just tested it with Pandas 25.1 (released ~2 weeks ago) and this code still works for me. That warning is not triggered since `df.meta` is a SimpleNamespace. Pandas will not try and build a column from it. – bscan Sep 10 '19 at 16:34
7

As mentioned in other answers and comments, _metadata is not a part of public API, so it's definitely not a good idea to use it in a production environment. But you still may want to use it in a research prototyping and replace it if it stops working. And right now it works with groupby/apply, which is helpful. This is an example (which I couldn't find in other answers):

df = pd.DataFrame([1, 2, 2, 3, 3], columns=['val']) 
df.my_attribute = "my_value"
df._metadata.append('my_attribute')
df.groupby('val').apply(lambda group: group.my_attribute)

Output:

val
1    my_value
2    my_value
3    my_value
dtype: object
Dennis Golomazov
  • 16,269
  • 5
  • 73
  • 81
7

As mentioned by @choldgraf I have found xarray to be an excellent tool for attaching metadata when comparing data and plotting results between several dataframes.

In my work, we are often comparing the results of several firmware revisions and different test scenarios, adding this information is as simple as this:

df = pd.read_csv(meaningless_test)
metadata = {'fw': foo, 'test_name': bar, 'scenario': sc_01}
ds = xr.Dataset.from_dataframe(df)
ds.attrs = metadata
jtwilson
  • 305
  • 3
  • 11
4

Coming pretty late to this, I thought this might be helpful if you need metadata to persist over I/O. There's a relatively new package called h5io that I've been using to accomplish this.

It should let you do a quick read/write from HDF5 for a few common formats, one of them being a dataframe. So you can, for example, put a dataframe in a dictionary and include metadata as fields in the dictionary. E.g.:

save_dict = dict(data=my_df, name='chris', record_date='1/1/2016')
h5io.write_hdf5('path/to/file.hdf5', save_dict)
in_data = h5io.read_hdf5('path/to/file.hdf5')
df = in_data['data']
name = in_data['name']
etc...

Another option would be to look into a project like xray, which is more complex in some ways, but I think it does let you use metadata and is pretty easy to convert to a DataFrame.

choldgraf
  • 3,539
  • 4
  • 22
  • 27
4

I have been looking for a solution and found that pandas frame has the property attrs

pd.DataFrame().attrs.update({'your_attribute' : 'value'})
frame.attrs['your_attribute']

This attribute will always stick to your frame whenever you pass it!

DisplayName
  • 219
  • 4
  • 23
  • 1
    Note that attrs is experimental and may change without warning, but this is a very simple solution. I wonder if attrs transfers to new dataframes. – Liquidgenius Jul 15 '20 at 20:51
  • 3
    Unfortunately, attrs aren't copied to new dataframes :( – Adam Jul 28 '20 at 19:31
4

Referring to the section Define original properties(of the official Pandas documentation) and if subclassing from pandas.DataFrame is an option, note that:

To let original data structures have additional properties, you should let pandas know what properties are added.

Thus, something you can do - where the name MetaedDataFrame is arbitrarily chosen - is

class MetaedDataFrame(pd.DataFrame):
    """s/e."""
    _metadata = ['instrument_name']

    @property
    def _constructor(self):
        return self.__class__

    # Define the following if providing attribute(s) at instantiation
    # is a requirement, otherwise, if YAGNI, don't.
    def __init__(
        self, *args, instrument_name: str = None, **kwargs
    ):
        super().__init__(*args, **kwargs)
        self.instrument_name = instrument_name

And then instantiate your dataframe with your (_metadata-prespecified) attribute(s)

>>> mdf = MetaedDataFrame(instrument_name='Binky')
>>> mdf.instrument_name
'Binky'

Or even after instantiation

>>> mdf = MetaedDataFrame()
>>> mdf.instrument_name = 'Binky'
'Binky'

Without any kind of warning (as of 2021/06/15): serialization and ~.copy work like a charm. Also, such approach allows to enrich your API, e.g. by adding some instrument_name-based members to MetaedDataFrame, such as properties (or methods):

    [...]
    
    @property
    def lower_instrument_name(self) -> str:
        if self.instrument_name is not None:
            return self.instrument_name.lower()

    [...]
>>> mdf.lower_instrument_name
'binky'

... but this is rather beyond the scope of this question ...

keepAlive
  • 6,369
  • 5
  • 24
  • 39
2

I was having the same issue and used a workaround of creating a new, smaller DF from a dictionary with the metadata:

    meta = {"name": "Sample Dataframe", "Created": "19/07/2019"}
    dfMeta = pd.DataFrame.from_dict(meta, orient='index')

This dfMeta can then be saved alongside your original DF in pickle etc

See Saving and loading multiple objects in pickle file? (Lutz's answer) for excellent answer on saving and retrieving multiple dataframes using pickle

SenAnan
  • 276
  • 4
  • 10
  • Yep, you can also save the metadata file in json if it is just a dictionary, rather than casting to a pandas dataframe and then save the dataframe. – SeF Mar 25 '21 at 11:43
1

Adding raw attributes with pandas (e.g. df.my_metadata = "source.csv") is not a good idea.

Even on the latest version (1.2.4 on python 3.8), doing this will randomly cause segfaults when doing very simple operations with things like read_csv. It will be hard to debug, because read_csv will work fine, but later on (seemingly at random) you will find that the dataframe has been freed from memory.

It seems cpython extensions involved with pandas seem to make very explicit assumptions about the data layout of the dataframe.

attrs is the only safe way to use metadata properties currently: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html

e.g.

df.attrs.update({'my_metadata' : "source.csv"})

How attrs should behave in all scenarios is not fully fleshed out. You can help provide feedback on the expected behaviors of attrs in this issue: https://github.com/pandas-dev/pandas/issues/28283

Jon
  • 952
  • 1
  • 11
  • 17
0

For those looking to store the datafram in an HDFStore, according to pandas.pydata.org, the recommended approach is:

import pandas as pd

df = pd.DataFrame(dict(keys=['a', 'b', 'c'], values=['1', '2', '3']))
df.to_hdf('/tmp/temp_df.h5', key='temp_df')
store = pd.HDFStore('/tmp/temp_df.h5') 
store.get_storer('temp_df').attrs.attr_key = 'attr_value'
store.close()
Olshansky
  • 5,904
  • 8
  • 32
  • 47