Python Pandas Data frame creation

Question

I tried to create a data frame df using the below code :

import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t

print df


   MUL1  MUL2
0     1     2
1     2     4
2     3     6
3     4     8
4     5    10
5     6    12

While trying to create the same data frame using the below syntax, I am getting a wierd output.

df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])

print df

   MUL1  MUL2
0   NaN   NaN
1   NaN   NaN

Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.

Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.

score 6 · Accepted Answer · answered Oct 04 '17 at 10:25

6

One of the correct ways would be to stack the array data from the input list holding those series into columns -

In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]: 
   MUL1  MUL2
0     1     2
1     2     4
2     3     6
3     4     8
4     5    10
5     6    12

Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -

In [162]: np.c_[s,t]
Out[162]: 
array([[ 1,  2],
       [ 2,  4],
       [ 3,  6],
       [ 4,  8],
       [ 5, 10],
       [ 6, 12]])

answered Oct 04 '17 at 10:25

Divakar

218,885
19
262
358

Thanks alot for your answer Sir. But I have a minor query. The series s and t used are columns that can be ascertained by printing one of these e.g. print s 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64 . So why we have to explicity use np.c_ to convert them to columns ? – Sarvagya Dubey Oct 04 '17 at 10:36
1

@SarvagyaDubey Well `s` and `t` are pandas series and most probably their indexes are messing it up when creating the dataframe with just `[s,t]`. With the stacking, it gets us the array data, as we are getting rid of those indexes. That helps us getting the desired dataframe agnostic of their previous index info. – Divakar Oct 04 '17 at 10:43
1

Hmmm. I think if input are Series is not very good approach convert to numpy array, because lost `index` information. Espacially if each Series has different indexes your solution failed. What you think? – jezrael Oct 04 '17 at 10:45
Your solution working only if default indexes or if same with `pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"], index=s.index)` – jezrael Oct 04 '17 at 10:47
@jezrael I think OP wants to get the data from `s` and `t` agnostic of their index information to create the output dataframe. I will let OP clarify on it if they need to for such cases with how to handle such cases/expected output. – Divakar Oct 04 '17 at 10:49
Yes, it depends of OP data mainly. – jezrael Oct 04 '17 at 10:52

jezrael · Answer 2 · 2017-10-04T10:37:34.350

5

If remove columns argument get:

df = pd.DataFrame([s,t])

print (df)
   0  1  2  3   4   5
0  1  2  3  4   5   6
1  2  4  6  8  10  12

Then define columns - if columns not exist get NaNs column:

df = pd.DataFrame([s,t], columns=[0,'MUL2'])

print (df)
     0  MUL2
0  1.0   NaN
1  2.0   NaN

Better is use dictionary:

df = pd.DataFrame({'MUL1':s,'MUL2':t})

print (df)
   MUL1  MUL2
0     1     2
1     2     4
2     3     6
3     4     8
4     5    10
5     6    12

And if need change columns order add columns parameter:

df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])

print (df)
   MUL2  MUL1
0     2     1
1     4     2
2     6     3
3     8     4
4    10     5
5    12     6

More information is in dataframe documentation.

Another solution by concat - DataFrame constructor is not necessary:

df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])

print (df)
   MUL1  MUL2
0     1     2
1     2     4
2     3     6
3     4     8
4     5    10
5     6    12

edited Oct 04 '17 at 10:37

answered Oct 04 '17 at 10:26

jezrael

822,522
95
1,334
1,252

I intend to create the data frame without using dictionaries. – Sarvagya Dubey Oct 04 '17 at 10:33
I add another solution - DataFrame constructor is not necessary. – jezrael Oct 04 '17 at 10:39
Thanks alot for your help – Sarvagya Dubey Oct 04 '17 at 10:43
1

@jezrael Thanks man, you're a life saver, so many useful answers:) – theProcrastinator Mar 20 '21 at 10:52

score 1 · Answer 3 · answered May 06 '21 at 10:29

A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:

a = [1,2,3]
b = [2,4,6]

df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])

# output 1:
   Col1  Col2  Col3
0     1     2     3
1     2     4     6

You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:

data = np.array([a, b]).transpose()

How to create a pandas dataframe

import pandas as pd

a = [1,2,3]
b = [2,4,6]

df = pd.DataFrame(dict(Col1=a, Col2=b))

Output:

   Col1  Col2
0     1     2
1     2     4
2     3     6

NiKiuS · Answer 4 · 2023-05-09T06:05:05.157

The NaN values are displayed because you're trying to create a dataframe using a 2x6 array, with 2 rows (s,t) and 6 columns (values of each series), but then, you defined a dataframe with 2 columns ["MUL1","MUL2"] for 2 rows [s,t], so the output would be a 2x2 array with no correct info due to the 6 values you have instead of 2 (2 columns passed, but passed data had 6 values). One method to solve this would be to transpose the series, so you will have the correct output.

The way I would do the code for this case would be the next:

import numpy as np
import pandas as pd

index = [0,1,2,3,4,5]

columns = ['MUL1', 'MUL2']

s = [1,2,3,4,5,6] 
t = [2,4,6,8,10,12]

df = pd.DataFrame(np.transpose([s,t]), columns = columns, index = index)

print(df)

Output:

   MUL1  MUL2
0     1     2
1     2     4
2     3     6
3     4     8
4     5    10
5     6    12

The same result would be creating the 2x6 array (to be called 'rows') and transpose it:

rows = [s,t]

df = pd.DataFrame(np.transpose(rows), columns = columns, index = index)

Python and Libraries version used:

Python 3.11 
NumPy 1.24
Pandas 2.0.1

I know this is an old thread, but I hope this would be useful for someone.

Python Pandas Data frame creation

4 Answers4

How to create a pandas dataframe

Linked