3

Is there a way to attach information on the data source to a pandas series? At the moment I just add columns to the dataframe that indicate the source for each variable...

Thanks a lot for ideas and suggestions!

jpp
  • 159,742
  • 34
  • 281
  • 339
hendriksc1
  • 59
  • 5

2 Answers2

3

From the offical pandas documentation:

To let original data structures have additional properties, you should let pandas know what properties are added. pandas maps unknown properties to data names overriding __getattribute__. Defining original properties can be done in one of 2 ways:

  1. Define _internal_names and _internal_names_set for temporary properties which WILL NOT be passed to manipulation results.

  2. Define _metadata for normal properties which will be passed to manipulation results.

Below is an example to define two original properties, “internal_cache” as a temporary property and “added_property” as a normal property

class SubclassedDataFrame2(DataFrame):

    # temporary properties
    _internal_names = pd.DataFrame._internal_names + ['internal_cache']
    _internal_names_set = set(_internal_names)

    # normal properties
    _metadata = ['added_property']

@property
def _constructor(self):
    return SubclassedDataFrame2

_

>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> df
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

>>> df.internal_cache = 'cached'
>>> df.added_property = 'property'

>>> df.internal_cache
cached
>>> df.added_property
property

# properties defined in _internal_names is reset after manipulation
>>> df[['A', 'B']].internal_cache
AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache'

# properties defined in _metadata are retained
>>> df[['A', 'B']].added_property
property

As you can see the benefit of defining custom properties through _metadata is that the properties will be propagated automatically during (most) one-to-one dataframe operations. Be aware though that during many-to-one dataframe operations (e.g. merge() or concat()) your custom properties will still be lost.

Xukrao
  • 8,003
  • 5
  • 26
  • 52
  • IIUC this adds properties to a `DataFrame`, not to its `Series` objects... – Michel de Ruiter Dec 19 '19 at 14:46
  • @MicheldeRuiter A `Series` object that is created from a `DataFrame` (e.g. by slicing) will indeed not inherit the dataframe's metadata properties. See also [pandas github issue #19850](https://github.com/pandas-dev/pandas/issues/19850). – Xukrao Dec 19 '19 at 22:43
  • Sure, but then it doesn't answer the (and my) question. – Michel de Ruiter Dec 24 '19 at 15:56
2

Like most Python objects, you can add an attribute using period (.) syntax. However, you should be careful your attribute names do not conflict with labels. Here's a demonstration:

import pandas as pd

s = pd.Series(list(range(3)), index=list('abc'))
s.a = 10
s.d = 20

print(s.a, s.d)

10 20

print(s)

a    10
b     1
c     2

As above you may unwittingly overwrite the value for a label when in fact you want to add an a attribute. One way to alleviate this problem, as described here, is to perform a simple check:

if 'a' not in s:
    s.a = 100
else:
    print('Attempt to overwrite label when setting attribute aborted!')
    # or raise a custom error

Note that operations on a dataframe such as GroupBy, pivot, etc, as described here, may return copies of data with attributes removed.

Finally, for storing dataframes or series with meta data attached, you may wish to consider HDF5. See, for example, this answer.

jpp
  • 159,742
  • 34
  • 281
  • 339