Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?

Question

I have a Numpy array consisting of a list of lists, representing a two-dimensional array with row labels and column names as shown below:

data = np.array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])

I'd like the resulting DataFrame to have Row1 and Row2 as index values, and Col1, Col2 as header values.

I can specify the index as follows:

df = pd.DataFrame(data, index=data[:,0])

However, I am unsure how to best assign column headers.

@behzad.nouri's answer is correct, but I think you should consider if you cannot have the initial data in another form. Because now, your values will be strings and not ints (because of the numpy array mixing ints and strings, so all are casted to string because numpy arrays have to be homogeneous). — joris, Dec 24 '13 at 15:54

score 428 · Accepted Answer · edited Aug 19 '23 at 10:18

428

Specify data, index and columns to the DataFrame constructor, as follows:

>>> pd.DataFrame(data=data[1:,1:],    # values
...              index=data[1:,0],    # 1st column as index
...              columns=data[0,1:])  # 1st row as the column names

As @joris mentions, you may need to change above to np.int_(data[1:,1:]) to have the correct data type.

edited Aug 19 '23 at 10:18

Mateen Ulhaq

24,552
19
101
135

answered Dec 24 '13 at 15:50

behzad.nouri

74,723
18
126
124

10

this works - but for such a common structure of input data and desired application to a `DataFrame` is there not some "shortcut"? This is basically the way that `csv`s are loaded - and can be managed by the _default_ handling for many csv readers. An analogous structure for df's would be useful. – WestCoastProjects Nov 17 '18 at 20:26
1

I added a mini helper/convenience method for this as a supplemental answer. – WestCoastProjects Nov 17 '18 at 21:03

score 174 · Answer 2 · edited Aug 07 '19 at 08:34

174

Here is an easy to understand solution

import numpy as np
import pandas as pd

# Creating a 2 dimensional numpy array
>>> data = np.array([[5.8, 2.8], [6.0, 2.2]])
>>> print(data)
>>> data
array([[5.8, 2.8],
       [6. , 2.2]])

# Creating pandas dataframe from numpy array
>>> dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
>>> print(dataset)
   Column1  Column2
0      5.8      2.8
1      6.0      2.2

edited Aug 07 '19 at 08:34

Jaroslav Bezděk

6,967
6
29
46

answered Jul 12 '18 at 14:28

Jagannath Banerjee

2,081
1
9
7

52

But you had to manually specify the `Series` names .. that's not scalable. – WestCoastProjects Nov 17 '18 at 20:25

score 30 · Answer 3 · edited May 23 '17 at 12:26

I agree with Joris; it seems like you should be doing this differently, like with numpy record arrays. Modifying "option 2" from this great answer, you could do it like this:

import pandas
import numpy

dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]

df = pandas.DataFrame(values, index=index)

score 28 · Answer 4 · edited Aug 25 '21 at 22:03

28

This can be done simply by using from_records of pandas DataFrame

import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)

edited Aug 25 '21 at 22:03

MD Mushfirat Mohaimin

1,966
3
10
22

answered Oct 07 '18 at 12:31

Aadil Srivastava

609
8
12

This answer does not work with the example data provided in the question, i.e. `data = array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])`. – jpp Oct 07 '18 at 12:47
1

The simplest general solution when we have not specified the labels. – cerebrou Apr 17 '20 at 10:40

Rahul Verma · Answer 5 · 2019-08-08T07:48:28.010

18

    >>import pandas as pd
    >>import numpy as np
    >>data.shape
    (480,193)
    >>type(data)
    numpy.ndarray
    >>df=pd.DataFrame(data=data[0:,0:],
    ...        index=[i for i in range(data.shape[0])],
    ...        columns=['f'+str(i) for i in range(data.shape[1])])
    >>df.head()
    [![array to dataframe][1]][1]

edited Aug 08 '19 at 07:48

answered Jun 27 '19 at 09:17

Rahul Verma

2,988
2
11
26

score 11 · Answer 6 · answered Jul 06 '20 at 18:12

Here simple example to create pandas dataframe by using numpy array.

import numpy as np
import pandas as pd

# create an array 
var1  = np.arange(start=1, stop=21, step=1).reshape(-1)
var2 = np.random.rand(20,1).reshape(-1)
print(var1.shape)
print(var2.shape)

dataset = pd.DataFrame()
dataset['col1'] = var1
dataset['col2'] = var2
dataset.head()

score 9 · Answer 7 · answered Nov 17 '18 at 21:01

Adding to @behzad.nouri 's answer - we can create a helper routine to handle this common scenario:

def csvDf(dat,**kwargs): 
  from numpy import array
  data = array(dat)
  if data is None or len(data)==0 or len(data[0])==0:
    return None
  else:
    return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)

Let's try it out:

data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
     ['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)

In [61]: csvDf(data)
Out[61]:
             a         b         c
row1  row1cola  row1colb  row1colc
row2  row2cola  row2colb  row2colc
row3  row3cola  row3colb  row3colc

score 8 · Answer 8 · answered Jun 25 '20 at 09:23

8

I think this is a simple and intuitive method:

data = np.array([[0, 0], [0, 1] , [1, 0] , [1, 1]])
reward = np.array([1,0,1,0])

dataset = pd.DataFrame()
dataset['StateAttributes'] = data.tolist()
dataset['reward'] = reward.tolist()

dataset

returns:

But there are performance implications detailed here:

How to set the value of a pandas column as list

answered Jun 25 '20 at 09:23

blue-sky

51,962
152
427
752

returns error 'numpy.ndarray' object has no attribute 'toList' – ozmank May 05 '22 at 21:24

score 1 · Answer 9 · answered Jun 25 '20 at 15:24

It's not so short, but maybe can help you.

Creating Array

import numpy as np
import pandas as pd

data = np.array([['col1', 'col2'], [4.8, 2.8], [7.0, 1.2]])

>>> data
array([['col1', 'col2'],
       ['4.8', '2.8'],
       ['7.0', '1.2']], dtype='<U4')

Creating data frame

df = pd.DataFrame(i for i in data).transpose()
df.drop(0, axis=1, inplace=True)
df.columns = data[0]
df

>>> df
  col1 col2
0  4.8  7.0
1  2.8  1.2

score 1 · Answer 10 · answered Mar 21 '23 at 18:06

1. Dtypes need to be recast

The problem with the original array is that it mixes strings with numbers, so the dtype of the array is either object or str which is not optimal for the dataframe. That can be remedied by calling astype at the end of dataframe construction.

df = pd.DataFrame(data[1:, 1:], index=data[1:, 0], columns=data[0, 1:]).astype(int)

2. Use `read_csv` for convenience

Since data in the OP is almost like a text file read in as a numpy array, one could convert it into a file-like object (using StringIO from the built-in io module) and use pd.read_csv instead. Since read_csv reads the first row as column labels, the only thing that needs to be specified is to read the first column as index. Also, read_csv infers the dtypes, so no need for astype() etc. either.

from io import StringIO
df = pd.read_csv(StringIO('\n'.join([','.join(row) for row in data.tolist()])), index_col=[0])

A convenience wrapper function for the latter case:

from io import StringIO
def read_array(data, index_col=[0], header=0):
    sio = StringIO('\n'.join([','.join(row) for row in data.tolist()]))
    return pd.read_csv(sio, index_col=index_col, header=header)

df = read_array(data)

One advantage of this method is that if there were MultiIndex columns or indices, there will need to be some manual work to construct the dataframe correctly with pd.DataFrame. Meanwhile, it's very easy with read_array() (because read_csv handles it internally, just delegate that stuff to pandas). For example, for the following data, just specify which rows are supposed to be read in as headers:

data = np.array([['', 'Col0', 'Col0'], ['', 'Col1', 'Col2'], ['Row1', 1, 2],['Row2', 3, 4]])

df = read_array(data, header=[0,1])

# to produce the equivalent with pd.DataFrame, pd.MultiIndex object needs to be constructed
df = pd.DataFrame(data[2:, 1:], index=data[2:, 0], columns=pd.MultiIndex.from_arrays(data[:2, 1:])).astype(int)

3. Cast numpy arrays to dataframe

This is for different cases than specified in the OP but in general, it's possible to cast a numpy array immediately into a pandas dataframe. If a custom stringified column labels are needed, just call add_prefix(). For example,

arr = np.arange(9).reshape(-1,3)
df = pd.DataFrame(arr).add_prefix('Col')

Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?

10 Answers10

1. Dtypes need to be recast

2. Use `read_csv` for convenience

3. Cast numpy arrays to dataframe

Linked

Related

Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?

10 Answers10

1. Dtypes need to be recast

2. Use read_csv for convenience

3. Cast numpy arrays to dataframe

Linked

Related

2. Use `read_csv` for convenience