I have produced some software that is processing data for analysis and plotting. For each type of data the data frames are produced in a module dedicated for the type. Depending on the structure of the data the data frame columns could be normal or multindex. I will pass the data frames to a procedure function that will produce plots of columns that are numeric.
I would like to be able to "attach" a string to each of the "printable" column with a string that will be used as plot labels. This string will not be the same as the name of the column.
I don't seem to be able to figure out a good way to do this purely with pandas DataFrame, so far I don't have any other solution either.
I have seen posts about metadata but I don't completely understand if this functionality is supported or not? At least I don't get this to work, especially it seems like using frames with MultiIndex columns complicates things. If it is not supported is it still on the todo list? From my reading I get the impression it have worked differently in different versions of pandas and even depend on if python 2 or 3 is used. I would like to know if there is a convenient way to accomplish what I require with Pandas data frames? Is using _metadata for this advisable? If so how?
I have looked around quite a bit but especially the MultiIndex concern seems to not be addressed anywhere.
This one seem to indicate that metadata should be supported but is it for data frames? I need Series in a DataFrame. Adding meta-information/metadata to pandas DataFrame
This one seem to be a similar question but I have tried the solution and it did not help, I tried the solution but it seems not to help me. Propagate pandas series metadata through joins
Here is some experimentation I have done based on my understanding of the use of _metadata functionality. It seems to indicate that the _metadata did not make any difference and that the attribute did not persist a copy. Also it shows that using MultiIndex is an even more "unsupported" case.
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from numpy.random import randn # To get values for the test frames
>>> import platform # To print python version
>>> # A function to set labels of the columns
>>> def labelSetter(aDF) :
... DFtmp = aDF.copy() # Just to ensure it is a different dataframe
... for column in DFtmp.columns :
... DFtmp[column].myLab='This is '+column.__str__()
... DFtmp[column].notMyLab='This should not persist'
... return DFtmp
...
>>>
>>> print 'Pandas version: {}'.format(pd.version.version)
Pandas version: 0.15.2
>>>
>>> pd.Series._metadata.append('myLab');print pd.Series._metadata # now _metadata contains 'myLab'
['name', 'myLab']
>>>
>>> # Make dataframes normal columns and MultiIndex
>>> dfS=pd.DataFrame(randn(2, 6),columns=['a1','a2','a3','b1','b2','c1']);print dfS
a1 a2 a3 b1 b2 c1
0 -0.934869 -0.310979 0.362635 -0.994605 -0.880114 -1.663265
1 0.205341 -1.642080 -0.732969 -0.080109 -0.082483 -0.208360
>>>
>>> dfMI=pd.DataFrame(randn(2, 6),columns=[['a','a','a','b','b','c'],['a1','a2','a3','b1','b2','c1']]);print dfMI
a b c
a1 a2 a3 b1 b2 c1
0 -0.578399 0.478925 1.047342 -0.087225 1.905074 0.146105
1 0.640575 0.153328 -1.117847 1.043026 0.671220 -0.218550
>>>
>>> # Run the labelSetter function on the data frames
>>> dfSWlab=labelSetter(dfS)
>>> dfMIWlab=labelSetter(dfMI)
>>>
>>> print dfSWlab['a2'].myLab
This is a2
>>> # This worked
>>>
>>> print dfSWlab['a2'].notMyLab
This should not persist
>>> # 'notMyLab' has not been appended to _metadata but the label still persists.
>>>
>>> dfSWlabCopy=dfSWlab.copy() # make a copy to see if myLab persists.
>>>
>>> dfSWlabCopy['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # 'myLab' was appended to _metadata but still did not persist the copy
>>>
>>> print dfMIWlab['a']['a2'].myLab
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1942, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'myLab'
>>> # For the MultiIndex data frame the 'myLab' is not accessible