1

Are there built-in ways to construct/deconstruct a dataframe from/to a Python list-of-Python-lists?

As far as the constructor (let's call it make_df for now) that I'm looking for goes, I want to be able to write the initialization of a dataframe from literal values, including columns of arbitrary types, in an easily-readable form, like this:

df = make_df([[9.75,   1],
              [6.375,  2],
              [9.,     3],
              [0.25,   1],
              [1.875,  2],
              [3.75,   3],
              [8.625,  1]],
             ['d', 'i'])

For the deconstructor, I want to essentially recover from a dataframe df the arguments one would need to pass to such make_df to re-create df.

AFAIK,

  1. officially at least, the pandas.DataFrame constructor accepts only a numpy ndarray, a dict, or another DataFrame (and not a simple Python list-of-lists) as its first argument;
  2. the pandas.DataFrame.values property does not preserve the original data types.

I can roll my own functions to do this (e.g., see below), but I would prefer to stick to built-in methods, if available. (The Pandas API is pretty big, and some of its names not what I would expect, so it is quite possible that I have missed one or both of these functions.)


FWIW, below is a hand-rolled version of what I described above, minimally tested. (I doubt that it would be able to handle every possible corner-case.)

import pandas as pd
import collections as co
import pandas.util.testing as pdt

def make_df(values, columns):
    return pd.DataFrame(co.OrderedDict([(columns[i],
                                         [row[i] for row in values])
                                        for i in range(len(columns))]))

def unmake_df(dataframe):
    columns = list(dataframe.columns)
    return ([[dataframe[c][i] for c in columns] for i in dataframe.index],
            columns)

values = [[9.75,   1],
          [6.375,  2],
          [9.,     3],
          [0.25,   1],
          [1.875,  2],
          [3.75,   3],
          [8.625,  1]]
columns = ['d', 'i']

df = make_df(values, columns)

Here's what the output of the call to make_df above produced:

>>> df
       d  i
0  9.750  1
1  6.375  2
2  9.000  3
3  0.250  1
4  1.875  2
5  3.750  3
6  8.625  1

A simple check of the round-trip1:

>>> df == make_df(*unmake_df(df))
True
>>> (values, columns) == unmake_df(make_df(*(values, columns)))
True

BTW, this is an example of the loss of the original values' types:

>>> df.values
array([[ 9.75 ,  1.   ],
       [ 6.375,  2.   ],
       [ 9.   ,  3.   ],
       [ 0.25 ,  1.   ],
       [ 1.875,  2.   ],
       [ 3.75 ,  3.   ],
       [ 8.625,  1.   ]])

Notice how the values in the second column are no longer integers, as they were originally.

Hence,

>>> df == make_df(df.values, columns)
False

1 In order to be able to use == to test for equality between dataframes above, I resorted to a little monkey-patching:

def pd_DataFrame___eq__(self, other):
    try:
        pdt.assert_frame_equal(self, other,
                               check_index_type=True,
                               check_column_type=True,
                               check_frame_type=True)
    except:
        return False
    else:
        return True

pd.DataFrame.__eq__ = pd_DataFrame___eq__

Without this hack, expressions of the form dataframe_0 == dataframe_1 would have evaluated to dataframe objects, not simple boolean values.

kjo
  • 33,683
  • 52
  • 148
  • 265
  • 1
    Pandas constructs dataframes fine when it is passed a list of lists: http://stackoverflow.com/questions/19112398/getting-list-of-lists-into-pandas-dataframe/19112890#19112890 – EdChum Sep 11 '14 at 15:12
  • 1
    you can use ``DataFrame.equals(other)`` for equality testing – Jeff Sep 11 '14 at 15:15
  • @EdChum: Yes, AFAICT, that behavior is not documented, so I don't want to bank on it. – kjo Sep 11 '14 at 17:05

1 Answers1

1

I'm not sure what documentation you are reading, because the link you give explicitly says that the default constructor accepts other list-like objects (one of which is a list of lists).

In [6]: pandas.DataFrame([['a', 1], ['b', 2]])
Out[6]: 
   0  1
0  a  1
1  b  2

[2 rows x 2 columns]

In [7]: t = pandas.DataFrame([['a', 1], ['b', 2]])

In [8]: t.to_dict()
Out[8]: {0: {0: 'a', 1: 'b'}, 1: {0: 1, 1: 2}}

Notice that I use to_dict at the end, rather than trying to get back the original list of lists. This is because it is an ill-posed problem to get the list arguments back (unless you make an overkill decorator or something to actually store the ordered arguments that the constructor was called with).

The reason is that a pandas DataFrame, by default, is not an ordered data structure, at least in the column dimension. You could have permuted the order of the column data at construction time, and you would get the "same" DataFrame.

Since there can be many differing notions of equality between two DataFrame (e.g. same columns even including type, or just same named columns, or some columns and in same order, or just same columns in mixed order, etc.) -- pandas defaults to trying to be the least specific about it (Python's principle of least astonishment).

So it would not be good design for the default or built-in constructors to choose an overly specific idea of equality for the purposes of returning the DataFrame back down to its arguments.

For that reason, using to_dict is better since the resulting keys will encode the column information, and you can choose to check for column types or ordering however you want to for your own application. You can even discard the keys by iterating the dict and simply pumping the contents into a list of lists if you really want to.

In other words, because order might not matter among the columns, the "inverse" of the list-of-list constructor maps backwards into a bigger set, namely all the permutations of the same column data. So the inverse you're looking for is not well-defined without assuming more structure -- and casual users of a DataFrame might not want or need to make those extra assumptions to get the invertibility.

As mentioned elsewhere, you should use DataFrame.equals to do equality checking among DataFrames. The function has many options that allow you specify the specific kind of equality testing that makes sense for your application, while leaving the default version as a reasonably generic set of options.

ely
  • 74,674
  • 34
  • 147
  • 228
  • Thanks for your comments. The documentation I linked says that the `data` argument may be a "numpy ndarray (structured or homogeneous), dict, or DataFrame", and follows this line with one that says "***Dict can contain*** Series, arrays, constants, or list-like objects" (my emphasis). I interpret this to mean that *when the `data` argument is a `dict`*, its values can be Series, arrays, etc. IOW, the "list-like objects" bit is not referring to the `data` argument itself. I am specifically looking for a `data` argument that is a Python list, not a dict, so this clause does not apply. – kjo Sep 11 '14 at 16:23
  • Thanks also for the comments on column order. If Pandas really considers column order not an important property of dataframes, then I'd say that their having implemented support for accessing columns by number (e.g. `df.iloc[:, 3]`) is a design blunder of colossal proportions. – kjo Sep 11 '14 at 16:27
  • Whether the numerical order of the column is important or not is a matter of the application, and thus the code that a *user* of pandas writes *around* their use of pandas. Just because index-based access is permitted doesn't mean that pandas needs to make any commital choice as to why or how it is permitted or used. *Not* permitting it, and thus needlessly restricting what options are available to programmers *who happen to want ordered columns* would be the bigger mistake. – ely Sep 11 '14 at 18:07
  • To clarify on the `dict` point you raised: yes this part could be documented more clearly. What is happening is that for a data input that is conformable to non-error-creating dimensions of data, if the column names and index are not given, then they just default to `range(num_columns)` and `range(num_rows)` respectively. So in effect, by passing a list of lists, you are saying: "I want to pass a `dict` of lists, but the name of the columns and rows is arbitrary so don't bother and just use integers starting from 0." – ely Sep 11 '14 at 18:11
  • It gets back to my other point: wanting an unordered set of columns to respect the ordering from an ordered data container is too specific to be handled as a default. Your approach of writing functions like `make_df` and `unmake_df` (that perhaps use decorators to add ordered structure to the DataFrame and memoize that ordered structure) is the "right" way to do it. Expecting a generic data structure (which, since it needs to be sufficiently generic, needs to support use as an unordered array and *also* support arbitrary item look-ups) to do this specific action for you is miswanting. – ely Sep 11 '14 at 18:14
  • For example, this is why `OrderedDict` was created. Why would people expect regular `dict` objects to keep track of the order the data was added? That's too specific for a generic map data structure, but it's fine to have it as a special extra case, and was implemented by subclassing `dict` and adding the extra behavior. The "real" solution to your question would be to make an analogous `OrderedDataFrame`. Then, by assumption, since the order matters, it can faithfully map backwards to an ordered list of lists used to construct it. And this would be done with code like `make_df` under the hood – ely Sep 11 '14 at 18:17