11

As a followup to my question on mixed types in a column:

Can I think of a DataFrame as a list of columns or is it a list of rows?

In the former case, it means that (optimally) each column has to be homogeneous (type-wise) and different columns can be of different types. The latter case, suggests that each row is type-wise homogeneous.

For the documentation:

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

This implies that a DataFrame is a list of columns.

Does it mean that appending a row to a DataFrame is more expensive than appending a column?

Community
  • 1
  • 1
Dror
  • 12,174
  • 21
  • 90
  • 160
  • It is probably worth reading this:http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe – EdChum Dec 09 '14 at 09:24
  • The citation I provided is from that link :) – Dror Dec 09 '14 at 09:26
  • In that case your thinking about the internal data structures is correct and Joris's answer explains this. Appending a row will be expensive because if the existing memory allocation is insufficient then a new allocation must be made and the contents copied such that it will be a contiguous block of memory for performance reasons – EdChum Dec 09 '14 at 09:29

2 Answers2

12

You are fully correct that a DataFrame can be seen as a list of columns, or even more a (ordered) dictionary of columns (see explanation here).

Indeed, each column has to be homogeneous of type, and different columns can be of different types. But by using the object dtype you can still hold different types of objects in one column (although not recommended apart for eg strings).
To illustrate, if you ask the data types of a DataFrame, you get the dtype for each column:

In [2]: df = pd.DataFrame({'int_col':[0,1,2], 'float_col':[0.0,1.1,2.5], 'bool_col':[True, False, True]})

In [3]: df.dtypes
Out[3]:
bool_col        bool
float_col    float64
int_col        int64
dtype: object

Internally, the values are stored as blocks of the same type. Each column, or collection of columns of the same type is stored in a separate array.

And this indeed implies that appending a row is more expensive. In general, appending multiple single rows is not a good idea: better to eg preallocate an empty dataframe to fill, or put the new rows/columns in a list and concat them all at once.
See the note at the end of the concat/append docs (just before the first subsection "Set logic on the other axes").

joris
  • 133,120
  • 36
  • 247
  • 202
  • 1
    Could you make this answer really perfect and add some references to relevant documentations of the discussed issues? – Dror Dec 09 '14 at 09:17
  • 1
    I added a link for concat/append, but for the internals, I don't think there is any good documentation. – joris Dec 09 '14 at 09:31
  • https://github.com/pandas-dev/pandas2/blob/master/source/internal-architecture.rst#what-is-blockmanager-and-why-does-it-exist here is the design document from the developers of Pandas. From what they say here, DataFrame is indeed implemented column-wise – Rafael Apr 06 '21 at 21:45
2

To address the question: Is appending a row to a DataFrame is more expensive than appending a column? We need to take into account various factors, but the most important one is the internal physical data layout of Pandas Dataframe.

The short and kind of naive answer: If the table(aka DataFrame) is stored in a column-wise physical layout, then add or fetch a column is faster than with a row; if the table is stored in a row-wise physical layout, it's the other way. In general, the default Pandas DataFrame is stored column-wise(but NOT all the time). So in general, appending a row to a DataFrame is indeed more expensive than appending a column. And you could consider the nature of Pandas DataFrame to be a dict of columns.

A longer answer: Pandas needs to choose a way to arrange the internal layout of a table in memory (such as a Dataframe of 10 rows and 2 columns). The most common two approaches are column-wise and row-wise.

Pandas is built on top of Numpy, and DataFrame and Seires are built on top of Numpy Array. But do notice though Numpy Array is internally stored row-wise in Memory, this is NOT the case for Pandas DataFrame. How DataFrame is stored depends on how it was initiated, cf this post:https://krbnite.github.io/Memory-Efficient-Windowing-of-Time-Series-Data-in-Python-2-NumPy-Arrays-vs-Pandas-DataFrames/

It's actually quite natural that Pandas adopt a column-wise layout most of the time because Pandas was designed to be a data analysis tool that relies more heavily on column-oriented operations than row-oriented operations. cf https://www.stitchdata.com/columnardatabase/

In the end, the answer to the question Is appending a row to a DataFrame is more expensive than appending a column? also depends on caching, prefetching etc. Thus it's a rather complicated question to answer and could depend on specific runtime conditions. But the most important factor is the data layout.


Answer from the authors of Pandas

The authors of Pandas actually mentioned this point in their design documentation. cf https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst#what-is-blockmanager-and-why-does-it-exist

So, to do anything row oriented on an all-numeric DataFrame, pandas would concatenate all of the columns together (using numpy.vstack or numpy.hstack) then use array broadcasting or methods like ndarray.sum (combined with np.isnan to mind missing data) to carry out certain operations.

Rafael
  • 1,761
  • 1
  • 14
  • 21