What's the best way to serialize a Dataframe inline?

Question

I'm trying to create a piece of executable code that includes the following DataFrame embedded in it:

Contract  Date      
201507    2014-06-18    1462.6
          2014-07-03    1518.6
          2014-09-05      10.2
201510    2015-09-14     977.9
201607    2016-05-17    1062.0

I'd like to be able to serialize my existing dataframe and just paste it into the code so that I can share an immediately executable example on another question on StackOverflow, without having to export to CSV etc.

How?

Edit:

The output of to_dict(), which lacks the index:

{(201510, Timestamp('2015-07-21 00:00:00')): 987.90000000000009,
 (201510, Timestamp('2015-08-10 00:00:00')): 973.60000000000014,
 (201604, Timestamp('2016-01-08 00:00:00')): 890.5,
 (201604, Timestamp('2016-01-19 00:00:00')): 837.20000000000005,
 (201607, Timestamp('2016-03-29 00:00:00')): 955.80000000000007}

Can I pickle it to a string and then do something like df=unpickle('xxxx') ? — cjm2671, Jul 10 '16 at 19:57
i think @ayhan first proposed you to use `to_dict()` and then `pd.DataFrame(df_dict)` - wasn't it that what you were looking for? PS ayhan, why did you remove it from your comment? — MaxU - stand with Ukraine, Jul 10 '16 at 20:04
@ayhan, the juanpa.arrivillaga's solution shows that it works with Timestamps as well... — MaxU - stand with Ukraine, Jul 10 '16 at 20:15

juanpa.arrivillaga · Accepted Answer · 2016-07-11T07:44:10.340

Perhaps the .to_dict method might be able to provide what you need?

In [22]: df
Out[22]: 
                     0         1         2         3
first second                                        
bar   one     0.857213  2.541895  0.632027 -0.723664
      two     0.670757  0.131845  0.443510 -0.215069
baz   one     0.244309  0.355917  1.369525  0.016433
      two     0.306323  1.997372 -0.034486 -0.632124
foo   one     1.899891  0.978404 -1.326377 -0.379395
      two    -0.258645  1.334551 -0.002280 -0.570494
qux   one     0.956760  1.516873  0.145715  0.548522
      two    -0.935483 -0.613533 -0.259667  1.678930

In [23]: df_dict = df.to_dict()

In [24]: df_dict
Out[24]: 
{0: {('bar', 'one'): 0.8572134743227553,
  ('bar', 'two'): 0.67075702403871984,
  ('baz', 'one'): 0.24430909274954596,
  ('baz', 'two'): 0.30632263405892973,
  ('foo', 'one'): 1.8998914080547422,
  ('foo', 'two'): -0.25864498582941658,
  ('qux', 'one'): 0.95676035178925078,
  ('qux', 'two'): -0.93548268578556593},
 1: {('bar', 'one'): 2.5418951943252983,
  ('bar', 'two'): 0.13184487691403465,
  ('baz', 'one'): 0.35591677598165794,
  ('baz', 'two'): 1.9973715806631951,
  ('foo', 'one'): 0.97840399034039371,
  ('foo', 'two'): 1.334550971309663,
  ('qux', 'one'): 1.5168730423092398,
  ('qux', 'two'): -0.61353256979962567},
 2: {('bar', 'one'): 0.63202740995444018,
  ('bar', 'two'): 0.44350955006551607,
  ('baz', 'one'): 1.3695250782939834,
  ('baz', 'two'): -0.034485597227602881,
  ('foo', 'one'): -1.32637743164928,
  ('foo', 'two'): -0.0022801431751758058,
  ('qux', 'one'): 0.14571459315814703,
  ('qux', 'two'): -0.25966683560443388},
 3: {('bar', 'one'): -0.72366363290625402,
  ('bar', 'two'): -0.21506930103507182,
  ('baz', 'one'): 0.016432503332560005,
  ('baz', 'two'): -0.63212432354247639,
  ('foo', 'one'): -0.37939466798831689,
  ('foo', 'two'): -0.57049399142274893,
  ('qux', 'one'): 0.54852179259808065,
  ('qux', 'two'): 1.6789299753495908}}

In [25]: pd.DataFrame(df_dict)
Out[25]: 
                0         1         2         3
bar one  0.857213  2.541895  0.632027 -0.723664
    two  0.670757  0.131845  0.443510 -0.215069
baz one  0.244309  0.355917  1.369525  0.016433
    two  0.306323  1.997372 -0.034486 -0.632124
foo one  1.899891  0.978404 -1.326377 -0.379395
    two -0.258645  1.334551 -0.002280 -0.570494
qux one  0.956760  1.516873  0.145715  0.548522
    two -0.935483 -0.613533 -0.259667  1.678930

In [26]:

You could just copy and paste the dictionary output into the pd.DataFrame constructor. This can even work with datetime objects if you use from pandas import Timestamp

In [37]: from pandas import Timestamp

In [38]: df2.to_dict()
Out[38]: 
{0: {0: Timestamp('2011-01-01 05:00:00'),
  1: Timestamp('2011-01-01 06:00:00'),
  2: Timestamp('2011-01-01 07:00:00'),
  3: Timestamp('2011-01-01 08:00:00'),
  4: Timestamp('2011-01-01 09:00:00')}}

In [39]: {0: {0: Timestamp('2011-01-01 05:00:00'),
   ....:   1: Timestamp('2011-01-01 06:00:00'),
   ....:   2: Timestamp('2011-01-01 07:00:00'),
   ....:   3: Timestamp('2011-01-01 08:00:00'),
   ....:   4: Timestamp('2011-01-01 09:00:00')}}
Out[39]: 
{0: {0: Timestamp('2011-01-01 05:00:00'),
  1: Timestamp('2011-01-01 06:00:00'),
  2: Timestamp('2011-01-01 07:00:00'),
  3: Timestamp('2011-01-01 08:00:00'),
  4: Timestamp('2011-01-01 09:00:00')}}

In [40]: pd.DataFrame({0: {0: Timestamp('2011-01-01 05:00:00'),
   ....:   1: Timestamp('2011-01-01 06:00:00'),
   ....:   2: Timestamp('2011-01-01 07:00:00'),
   ....:   3: Timestamp('2011-01-01 08:00:00'),
   ....:   4: Timestamp('2011-01-01 09:00:00')}})
Out[40]: 
                    0
0 2011-01-01 05:00:00
1 2011-01-01 06:00:00
2 2011-01-01 07:00:00
3 2011-01-01 08:00:00
4 2011-01-01 09:00:00

EDIT

I'm pretty sure the issue you are having is that you'e been using a Series, likely the result of the use of a column slice, e.g. df["colname"] See how I deserialized your dict:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas import Timestamp

In [4]: d = {(201510, Timestamp('2015-07-21 00:00:00')): 987.90000000000009,
   ...:  (201510, Timestamp('2015-08-10 00:00:00')): 973.60000000000014,
   ...:  (201604, Timestamp('2016-01-08 00:00:00')): 890.5,
   ...:  (201604, Timestamp('2016-01-19 00:00:00')): 837.20000000000005,
   ...:  (201607, Timestamp('2016-03-29 00:00:00')): 955.80000000000007}

In [5]: pd.DataFrame(d)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-62c9f5619d37> in <module>()
----> 1 pd.DataFrame(d)

/home/juan/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    221                                  dtype=dtype, copy=copy)
    222         elif isinstance(data, dict):
--> 223             mgr = self._init_dict(data, index, columns, dtype=dtype)
    224         elif isinstance(data, ma.MaskedArray):
    225             import numpy.ma.mrecords as mrecords

/home/juan/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    357             arrays = [data[k] for k in keys]
    358 
--> 359         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    360 
    361     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

/home/juan/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5238     # figure out the index, if necessary
   5239     if index is None:
-> 5240         index = extract_index(arrays)
   5241     else:
   5242         index = _ensure_index(index)

/home/juan/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in extract_index(data)
   5277 
   5278         if not indexes and not raw_lengths:
-> 5279             raise ValueError('If using all scalar values, you must pass'
   5280                              ' an index')
   5281 

ValueError: If using all scalar values, you must pass an index

In [6]: S = pd.Series(d)

In [7]: S
Out[7]: 
201510  2015-07-21    987.9
        2015-08-10    973.6
201604  2016-01-08    890.5
        2016-01-19    837.2
201607  2016-03-29    955.8
dtype: float64

In [8]: df = pd.DataFrame(S)

In [9]: df
Out[9]: 
                       0
201510 2015-07-21  987.9
       2015-08-10  973.6
201604 2016-01-08  890.5
       2016-01-19  837.2
201607 2016-03-29  955.8

Using to_dict() seems to drop the index. It looks like to_dict() takes a parameter, but it refuses the parameter when I include it. I've pasted to_dict() output above in my question. — cjm2671, Jul 11 '16 at 06:15
@cjm2671 Hmmm, try using `pd.Series` constructor. Else, try solutions here: http://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe — juanpa.arrivillaga, Jul 11 '16 at 06:38
@cjm2671 I suspect that you may be slicing your dataframe using the `df[colname]` construct, which returns a `pd.Series` and using the `pd.Series.to_dict` method. — juanpa.arrivillaga, Jul 11 '16 at 06:43
Sorry, I mean that pd.DataFrame complains about a lack of index when I use from_dict() — cjm2671, Jul 11 '16 at 07:38
@cjm2671 did you try to use the `pd.Series` constructor? Again, I'm pretty sure you're actually using a `Series` rather than a `DataFrame` — juanpa.arrivillaga, Jul 11 '16 at 07:41
@cjm2671 juanpa is right. If you call to_dict on a series, you need to execute `pd.Series(result)`. — ayhan, Jul 11 '16 at 07:41

What's the best way to serialize a Dataframe inline?

1 Answers1

EDIT