4

I have a need to save a Pandas DataFrame, along with some metadata to a file in JSON format. (The JSON format is a requirement.)

Background
A) I can successfully read/write my rather large Pandas Dataframe from/to JSON using DataFrame.to_json() and DataFrame.from_json(). No problems.

B) I have no problems saving my metadata (dict) to JSON using json.dump()/json.load()


My first attempt
Since Pandas does not support DataFrame metadata directly, my first thought was to

top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
json.dump(top_level_dict, fp)


Failure modes
C) I have found that even the simplified case of

df_dict = df.to_dict()
json.dump(df_dict, fp)

fails with:

TypeError: key (u'US', 112, 5, 80, 'wl') is not a string

D) Investigating, I've found that the complement also fails.

df.to_json(fp)
json.load(fp)

fails with

384             raise ValueError("No JSON object could be decoded")
ValueError: Expecting : delimiter: line 1 column 17 (char 16)

So it appears that Pandas JSON format and the Python's JSON library are not compatible.

My first thought is to chase down a way to modify the df.to_dict() output of C to make it amenable to Python's JSON library, but I keep hearing "If you're struggling to do something in Python, you're probably doing it wrong." in my head.


Question
What is the cannonical/recommended method for adding metadata to a Pandas DataFrame and storing to a JSON-formatted file?

Python 2.7.10
Pandas 0.17

Edit 1:
While trying out Evan Wright's great answer, I found the source of my problems: Pandas (as of 0.17) does not like saving Multi-Indexed DataFrames to JSON. The library I had created to save my (Multi-Indexed) DataFrames is quietly performing a df.reset_index() before calling DataFrame.to_json(). My newer code was not. So it was DataFrame.to_json() burping on the MultiIndex.

Lesson: Read the documentation kids, even when it's your own documentation.

Edit 2:

If you need to store both the DataFrame and the metadata in a single JSON object, see my answer below.

cdabel
  • 461
  • 9
  • 21
JS.
  • 14,781
  • 13
  • 63
  • 75

2 Answers2

6

You should be able to just put the data on separate lines.

Writing:

f = open('test.json', 'w')
df.to_json(f)
print >> f
json.dump(metadata, f)

Reading:

f = open('test.json')
df = pd.read_json(next(f))
metdata = json.loads(next(f))
Evan Wright
  • 680
  • 5
  • 9
3

In my question, I erroneously stated that I needed the JSON in a file. In that situation, Evan Wright's answer is my preferred solution.

In my case, I actually need to store the JSON output as a single "blob" in a database, so my dictionary-wrangling approach appears to be necessary.

If you similarly need to store the data and metadata in a single JSON blob, the following code will work:

top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
with open(FILENAME, 'w') as outfile:
    json.dump(top_level_dict, outfile)

Just make sure DataFrame is singly-indexed. If it's Multi-Indexed, reset the index (i.e. df.reset_index()) before doing the above.

Reading the data back in:

with open(FILENAME, 'r') as infile:
    top_level_dict = json.load(infile)

df_as_dict = top_level_dict.pop('data', {})
df = pandas.DataFrame().as_dict(df_as_dict)

meta = top_level_dict['metadata']

At this point, you'll need to re-create your Multi-Index (if applicable)

JS.
  • 14,781
  • 13
  • 63
  • 75
  • 1
    This answer is great! Regarding reading data back in, it looks that _pandas_ API has changed and currently `from_dict` should be used instead of `as_dict`. – apawelek Feb 20 '21 at 09:34