I need to share well described data and want to do this in a modern way that avoids managing bureaucratic documentation no one will read. Fields require some description or note (eg. "values don't include ABC because XYZ") which I'd like to associate to columns that'll be saved with pd.to_<whatever>()
, but I don't know of such functionality in pandas.
The format can't present security concerns, and should have a practical compromise between data integrity, performance, and file size. Looks like JSON without index might suit.
JSON documentation writes of schema annotations, which supports pairing keywords like description
with strings, but I can't figure out how to use this with options described in pandas to_json
documentation.
Example df:
df = pd.DataFrame({"numbers": [6, 2],"strings": ["foo", "whatever"]})
df.to_json('temptest.json', orient='table', indent=4, index=False)
We can edit the JSON to include description
:
"schema":{
"fields":[
{
"name":"numbers",
"description": "example string",
"type":"integer"
},
...
We can then df = pd.read_json("temptest.json", orient='table')
but descriptions seem ignored and are lost upon saving.
The only other answer I found saves separate dicts and dfs into a single JSON, but I couldn't replicate this without "ValueError: Trailing data". I need something less cumbersome and error prone, and files requiring custom instructions on how to open them aren't appropriate.
How can we can work with and save brief data descriptions with JSON or another format?