5

I have finally figured out how to use _metadata from a DataFrame, everything works except I am unable to persist it such as to hdf5 or json. I know it works because I copy the frame and _metadata attributes copy over "non _metadata" attributes don't.

example

df = pandas.DataFrame #make up a frame to your liking
pandas.DataFrame._metadata = ["testmeta"]
df.testmeta = "testmetaval"
df.badmeta = "badmetaval"
newframe = df.copy()
newframe.testmeta -->outputs "testmetaval"
newframe.badmeta ---> raises attribute error

#json test
df.to_json(Path)
revivedjsonframe = pandas.io.json.read_json(Path)
revivedjsonframe.testmeta ---->raises Attribute Error

#hdf5 test
revivedhdf5frame.testmeta ---> returns None

this person https://stackoverflow.com/a/25715719/4473236 says it worked for him but I'm new to this site (and pandas) and can't post to that thread or ask him directly.

Community
  • 1
  • 1
Skorpeo
  • 2,362
  • 2
  • 15
  • 20
  • My understanding is that `copy()` does not copy metadata, maybe this has changed, strangely though I don't see your error I get `'badmetaval'` outputted, what version python, numpy and pandas are you running? I'm running python 3.3.5 64-bit, pandas 0.15.2 and numpy 1.9.1 – EdChum Jan 20 '15 at 09:44
  • I'm using python 2.7.5 the rest is the same. I don't know why it would work, my understanding is that only _metadata attributes "propagate" did you try to persisting it to json? Did it have the attributes in the json file? – Skorpeo Jan 20 '15 at 09:54
  • Yep still all works for me so not sure what you problem could be, are you able to try a python 3 version? – EdChum Jan 20 '15 at 10:10
  • not right now... strange, what really befuddles me is why badmetaval would copy over...that would mean that this issue is moot https://github.com/pydata/pandas/issues/2485 – Skorpeo Jan 20 '15 at 10:30
  • @EdChum: I'm seeing the same behavior as Skorp, using pandas 0.15.2. Would you please post the code you are using? – unutbu Jan 20 '15 at 13:56

2 Answers2

5

_metadata is prefaced with an underscore, which means it's not part of the public API. It's not intended for user code -- we might break it in any future version of pandas without warning.

I would strongly recommend against using this "feature". For now, the best option for persisting metadata with a DataFrame is probably to write your own wrapper class and handle the persistence yourself.

shoyer
  • 9,165
  • 1
  • 37
  • 55
  • You are right with respect to the _metadata piece. In this case the behaviour of df.badmeta that EdChum reports above is very strange and it does not use _metadata. I am not sure where the values are coming from or where they are stored, being a novice I would be worried that somehow the class is being appended to since how else can he retrieve the values of badmeta? It should not exist since he is loading a empty json but yet the dataframe has an attribute that is not in the file. A artifact of some sort...Finally, I would love to help to get this metadata functionality implemented. – Skorpeo Jan 21 '15 at 00:20
  • Will you break it and delete or break and replace with a similar feature?.)) I just need to store a boolean flag together with dataframe on disk. Don't want to make (un)pickling more complex – Winand Nov 11 '15 at 13:21
1

This is my code which works using python 3.3.3.2 64-bit

In [69]:

df = pd.DataFrame() #make up a frame to your liking
pd.DataFrame._metadata = ["testmeta"]
print(pd.DataFrame._metadata)
df.testmeta = "testmetaval"
df.badmeta = "badmetaval"
newframe = df.copy()
print(newframe.testmeta)
print("newframe", newframe.badmeta)
df.to_json(r'c:\data\test.json')
read_json = pd.read_json(r'c:\data\test.json')
read_json.testmeta
print(pd.version.version)
print(np.version.full_version)
Out[69]:

['testmeta']
testmetaval
newframe badmetaval
0.15.2
1.9.1

JSON contents as df:

In [70]:

read_json
Out[70]:
Empty DataFrame
Columns: []
Index: []
In [71]:

read_json.info()
<class 'pandas.core.frame.DataFrame'>
Float64Index: 0 entries
Empty DataFrame

In [72]:

read_json.testmeta
Out[72]:
'testmetaval'

Strangely the json that is written is just an empty parentheses:

{}

which would indicate that the metadata is actually being propagated by the statement line: pd.DataFrame._metadata = ["testmeta"]

Seems to still work if you overwrite a 2nd atrtibute's metadata:

In [75]:

df.testmeta = 'foo'
df2 = pd.DataFrame()
df2.testmeta = 'bar'
read_json = pd.read_json(r'c:\data\test.json')
print(read_json.testmeta)
print(df2.testmeta)
testmetaval
bar
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks for posting this. What value is printed by `read_json.testmeta`? What are the contents of `c:\data\test.json`? – unutbu Jan 20 '15 at 15:15
  • @unutbu sorry but ipython output gets a little screwy when you have print statements and the last statement is a variable, I've added the output separator which should answer your question, I'll post the contents of the json also – EdChum Jan 20 '15 at 15:17
  • Wait, why are we modifying `_metadata` in the `DataFrame` class itself? – DSM Jan 20 '15 at 15:21
  • @DSM yes that is odd, unclear what the OP intends to gain by making such a mod – EdChum Jan 20 '15 at 15:22
  • I wonder if you define two DataFrames, one with `df.testmeta = 'foo'` and another with `df2.testmeta = 'bar'`, and then define `read_json = pd.read_json(r'c:\data\test.json')`, what will `read_json.testmeta` return. How would it know to look in `df.testmeta` and not `df2.testmeta`? – unutbu Jan 20 '15 at 15:33
  • This has to be a bug. if nothing is in the json file then something is not right. From my unlearned understanding _metadata just tells pandas what attributes are metadata then it is up to each frame to create/assign those attributes. Here are a few more links that I used in my journey: http://stackoverflow.com/questions/23200524/propagate-pandas-series-metadata-through-joins, https://github.com/pydata/pandas/issues/6923. I guess I should figure out how to report this.... – Skorpeo Jan 20 '15 at 19:50