1

I have several series variables I would like to concatenate (along axis=1) to create a DataFrame. I would like the series' names to appear as column names in the DataFrame. I have come across several ways to do this.

The most intuitive approach seems to me to be the following :

import pandas as pd

x1 = pd.Series([1,2,3],name='x1')
x2 = pd.Series([11,12,13],name='x2')
              
df = pd.DataFrame([x1,x2])
print(df)

But rather than make the Series names the column headers, the series data are used as rows in the DataFrame.

     0   1   2
x1   1   2   3
x2  11  12  13

This strikes me as counter-intuitive for two reasons.

  • The data in a Series is likely to be all of one type of data, i.e. stock prices, time series data, etc. So it seems intuitive that the Series data should be a column, rather than a row, in the DataFrame.

  • When extracting a column as a Series from an existing DataFrame, the column name is used as the name of the Series.

Example :

df = pd.DataFrame({'x1' : [1,2,3], 'x2' : [4,5,6]})
print(type(df['x1']))
print(df['x1'].name)

<class 'pandas.core.series.Series'>
x1

So why isn't the name used as column header when constructing a DataFrame from a Series?```

I can always construct a DataFrame from a dictionary to get the result I want :

df = pd.DataFrame({'x1' : x1, 'x2' : x2})
print(df)

   x1  x2
0   1  11
1   2  12
2   3  13

But this strikes me as awkward, since I would have to duplicate the series' names (or at least refer to them in the construction of the dictionary).

On the other hand, the Pandas concat method does what I would expect for default behavior :

df = pd.concat([x1,x2],axis=1)
print(df)

   x1  x2
0   1  11
1   2  12
2   3  13

So, my question is, why isn't the behavior I get with concat the default behavior when constructing a DataFrame from a list of series variables?

Donna
  • 1,390
  • 1
  • 14
  • 30
  • 1
    you should ask authors of pandas why they decide this. But for me it seems correct - Series may have assigned names to values instead of numbers 0,1,2 - - `pd.Series({"X": 1, "Y": 2, "Z": 3}, name='position1')` - so they are like "headers" - but normally pandas display it as indexes. And this way it keeps different information about one object and DataFrame keeps objects in rows. BTW: if you use `concat()` with default values - `df = pd.concat([x1,x2])` then you get different result. `axis=1` is NOT default value. – furas May 02 '21 at 22:52
  • Does this mean that a Series can also be viewed as a something like a C-struct, with a heterogeneous collection of fields? As in `pd.Series({"v" : [1,2,3],"type" : "vector"})` ? It never occurred to me that this would work (it does!). I didn't appreciate this use mode (if in fact that is an intended use). – Donna May 03 '21 at 13:43
  • Does this answer your question? [Pandas: Creating DataFrame from Series](https://stackoverflow.com/questions/23521511/pandas-creating-dataframe-from-series) – Reinderien Sep 26 '21 at 17:36
  • This is a duplicate of - and is missing crucial answers from - https://stackoverflow.com/a/23522030/313768 ; particularly the `concat` approach. – Reinderien Sep 26 '21 at 17:36

1 Answers1

1
x1 = pd.Series([1,2,3],name='x1')
x2 = pd.Series([11,12,13],name='x2')

df = pd.DataFrame([x1,x2]).transpose()
>>> df
   x1  x2
0   1  11
1   2  12
2   3  13

Because pd.DataFrame does not make a zip for you:

>>> pd.DataFrame(zip(x1, x2), columns=[x1.name, x2.name])
   x1  x2
0   1  11
1   2  12
2   3  13
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Right - this is another way. I just need to get my head around the fact that a series is not just a fancy "array", but (if I understand the intended use) can also be a collection of heterogeneous fields. – Donna May 03 '21 at 13:47
  • I always think of `transpose` as very expensive operation so typically avoid it. Is transposing a DataFrame much cheaper operation than matrix transpose? – Donna May 03 '21 at 13:51
  • 1
    I think (but I'm not a numpy expert) the arrays are stored internally in a certain way. Functions like reshape or transpose only return a "view" of the data, so they are not cpu expensive. – Corralien May 03 '21 at 14:04
  • 1
    For an array of (10000, 50000): `%timeit a.transpose()` give 231 ns ± 3.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) – Corralien May 03 '21 at 14:05
  • I just tried `df = pd.DataFrame(a)` where `a` is a (10000,50000) Numpy array. Whereas `a.transpose()` took 199ns/loop, `df.transpose()` took 3.91ms/loop or about 20000 times longer? That can't be right ... – Donna May 03 '21 at 14:38
  • 1
    `df.transpose()` is not a simple call to `a.transpose()` or `df.values.transpose()`. You can check [here](https://github.com/pandas-dev/pandas/blob/2cb96529396d93b46abab7bbc73a208e708c642e/pandas/core/frame.py#L2805) the implementation. – Corralien May 03 '21 at 15:34