Saving pandas dataframes in formats including data descriptions or other self documenting strings in schemas? JSON?

Question

I need to share well described data and want to do this in a modern way that avoids managing bureaucratic documentation no one will read. Fields require some description or note (eg. "values don't include ABC because XYZ") which I'd like to associate to columns that'll be saved with pd.to_<whatever>(), but I don't know of such functionality in pandas.

The format can't present security concerns, and should have a practical compromise between data integrity, performance, and file size. Looks like JSON without index might suit.

JSON documentation writes of schema annotations, which supports pairing keywords like description with strings, but I can't figure out how to use this with options described in pandas to_json documentation.

Example df:

df = pd.DataFrame({"numbers": [6, 2],"strings": ["foo", "whatever"]})
df.to_json('temptest.json', orient='table', indent=4, index=False)

We can edit the JSON to include description:

    "schema":{
        "fields":[
            {
                "name":"numbers",
                "description": "example string",
                "type":"integer"
            },
     ...

We can then df = pd.read_json("temptest.json", orient='table') but descriptions seem ignored and are lost upon saving.

The only other answer I found saves separate dicts and dfs into a single JSON, but I couldn't replicate this without "ValueError: Trailing data". I need something less cumbersome and error prone, and files requiring custom instructions on how to open them aren't appropriate.

How can we can work with and save brief data descriptions with JSON or another format?

What would the problem with just adding a `description` column and using `to_json()`? — RJ Adriaansen, Jul 16 '21 at 06:51
@RJAdriaansen This would double the number of columns which would create more confusion than descriptions would resolve. Also, data requiring such descriptions tends to involve more rows so this would be verbose. — for_all_intensive_purposes, Jul 16 '21 at 08:31

Timus · Answer 1 · 2021-07-17T13:49:36.440

Would the following rough sketch be something you could live with:

Step 1: Create json structure out of df like you did

df.to_json('temp.json', orient='table', indent=4, index=False)

Step 2: Add the column description to the so produced json-file as you already did (could be done easily in a structured/programatically manner):

{
    "schema":{
        "fields":[
            {
                "name":"numbers",
                "type":"integer",
                "description": "example number"
            },
            {
                "name":"strings",
                "type":"string",
                "description": "example string"
            }
        ],
        "pandas_version":"0.20.0"
    },
    "data":[...]
}

One way to do that would be to write a little function that uses Pandas .to_json as a base output and then adds the desired descriptions to the Pandas json-dump (this is step 1 & 2 together):

import json

def to_json_annotated(filepath, df, annotations):
    df_json = json.loads(df.to_json(orient='table', index=False))
    for field in df_json['schema']['fields']:
        field['description'] = annotations.get(field['name'], None)
    with open(filepath, 'w') as file:
        json.dump(df_json, file)

As in the example above:

annotations = {'numbers': 'example number',
               'strings': 'example string'}
to_json_annotated('temp.json', df, annotations)

Step 3: Reading the information back into Pandas-format:

import json

with open('temp.json', 'r') as file:
    json_df = json.load(file)
df_data = pd.json_normalize(json_df, 'data')  # pd.read_json('temp.json', orient='table') works too
df_meta = pd.json_normalize(json_df, ['schema', 'fields'])

with the results:

df_data:

   numbers   strings
0        6       foo
1        2  whatever

df_meta:

      name     type     description
0  numbers  integer  example number
1  strings   string  example string

This looks as good as anyone could hope for. Files requiring custom instructions on how to open them aren't appropriate, but I suppose we could `pd.read_json("temptest.json", orient='table')`, except pandas would ignore descriptions. — for_all_intensive_purposes, Jul 17 '21 at 03:44
I'm trying to program your step #2. We could save df then `with open('temptest.json') as f: d = json.loads(f.read())['schema']['fields']`, but how could we insert a `description` key + string value into the dict where `name` matches the col in a list of unnamed dicts? — for_all_intensive_purposes, Jul 17 '21 at 03:50
@ChrisDixon I've added an example how this could be done. (General remark: I'm not saying that this solution is _pretty_ :) Maybe someday the Pandas devs add such a feature to the DataFrames etc.) — Timus, Jul 17 '21 at 12:51
We may soon be in luck: https://github.com/pandas-dev/pandas/issues/42582#issuecomment-881925029 — for_all_intensive_purposes, Jul 18 '21 at 22:20

score 0 · Answer 2 · answered Jul 16 '21 at 08:34

0

I didn't see that there will be such an option but I think you can just add a description inside of each variable:

schema = {'first':{'Variable':'x',"description": "example string","value": 2},"second":{"Variable":"y","description": "example string","value": 3}}

It creates a table:

                     first          second
Variable                  x               y
description  example string  example string
value                     2               3

answered Jul 16 '21 at 08:34

Ulewsky

310
1
11

Thank you, but supplying existing dfs as dicts and then adding a new row of strs into all cols doesn't seem practical. Also, I don't think `pd.DataFrame(schema)` yields an appropriate df. – for_all_intensive_purposes Jul 16 '21 at 08:47

Saving pandas dataframes in formats including data descriptions or other self documenting strings in schemas? JSON?

2 Answers2