1

The parameters section of the documentation for DataFrame (as of pandas 2.0.0) begins:

data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame

Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs.

If data is a list of dicts, column order follows insertion-order.

The description points to valid input types (i.e., ndarray, Iterable, dict, or DataFrame) but does not completely describe how the constructor will turn the data into a DataFrame. It seems like somewhat of a black box. Should I be able to predict, based on the documentation, that, say, passing a list containing a single Series and no other arguments will give a result that looks like Series.to_frame().T (although the dtypes may differ; see this answer and this one)?

The purpose of this question is to solicit answers that classify the different ways of passing data to a DataFrame() via data, according to how the constructor puts or massages the data into the DataFrame. It is necessarily a broad question, but there should be a finite number of cases given that the constructor is, you know, implemented in code. I'm interested in this question and would be willing to dig through the source code a little to discover the answer; however, I think others with more experience may have insights to share here before I do that.

This is a single question about rules broadly, and I believe its answers belong together in one place. However, since it is broad, I will provide some specific sub-questions to get us started:

  • For iterables, what container and element combinations are valid? Without needing to try it, should I be able to predict what will happen if I pass a list of DataFrames or a Series of Series? Which axis is used when a Series input is "aligned by its index"? Does the treatment depend at all on what its elements are?

  • How do the container and element types passed via data affect how the DataFrame will be put together? Should I be able to predict how the data will be aligned along the axes of the resulting DataFrame based on knowledge of data alone? I don't know if the answer is obvious, but in either case I do not see it documented.

  • If I think of a DataFrame as "a dict-like container for Series objects" (as docs suggest), what are the intuitive rules governing how data gets interpreted (loosely) into keys and values?

I'm open to suggestions for improving the question, but I do think it's a question that needs to be asked and I did not find a similar question on this site.

Attila the Fun
  • 327
  • 2
  • 13

2 Answers2

2

As a rule of thumb:

  • If data is listlike, items become rows
  • If data is dictlike, items become columns (with keys as column names).

Then for items within data:

  • If items are dictlike, item keys become the names of the other axis (i.e., columns if data was listlike, index if data was dictlike).

Finally, Series is dictlike mapping index to values; DataFrame is dictlike mapping columns to Series.

Attila the Fun
  • 327
  • 2
  • 13
1

Besides the documentation, it's sometimes useful to read the tests, especially test_constructors.py in your case. There are many ways to build a DataFrame.

Too long to describe all ways, take a look to test_constructors.py

Corralien
  • 109,409
  • 8
  • 28
  • 52