easy multidimensional numpy ndarray to pandas dataframe method?

Question

Having a 4-D numpy.ndarray, e.g.

myarr = np.random.rand(10,4,3,2) dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}

But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?

I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.

What would be a function to which to pass the column/row indexes and create a dataframe? Something like:

df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])

And and up with something like:

              meas1             meas2
              A     B     C     A    B    C
sub   time
  1      1
         2
         3
         .
         .
  2      1
         2
 ...

If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.

I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:

a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])



pd.DataFrame(a.reshape(2*3*1,2*2),index)

gives:

ValueError: Shape of passed values is (4, 6), indices imply (4, 24)

score 5 · Accepted Answer · answered Apr 26 '16 at 05:15

You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:

a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])

pd.DataFrame(a.reshape(24, 1),index=index)

Solution

Here's a generic DataFrame creator that should get the job done:

def produce_df(rows, columns, row_names=None, column_names=None):
    """rows is a list of lists that will be used to build a MultiIndex
    columns is a list of lists that will be used to build a MultiIndex"""
    row_index = pd.MultiIndex.from_product(rows, names=row_names)
    col_index = pd.MultiIndex.from_product(columns, names=column_names)
    return pd.DataFrame(index=row_index, columns=col_index)

Demonstration

Without named index levels

produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])

       1         2     
       3    4    3    4
a c  NaN  NaN  NaN  NaN
  d  NaN  NaN  NaN  NaN
b c  NaN  NaN  NaN  NaN
  d  NaN  NaN  NaN  NaN

With named index levels

produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
           row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])

number1          1         2     
number2          3    4    3    4
alpha1 alpha2                    
a      c       NaN  NaN  NaN  NaN
       d       NaN  NaN  NaN  NaN
b      c       NaN  NaN  NaN  NaN
       d       NaN  NaN  NaN  NaN

Now I see the mistake about the extra dimension, thanks. nifty little function! — TNT, Apr 26 '16 at 07:02

B. M. · Answer 2 · 2016-04-26T08:34:29.837

2

From the structure of your data,

names=['sub','time','measure','cond']  #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]

A straightforward way to your goal:

index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()

df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])


"""
measure  meas1         meas2        
cond         A   B   C     A   B   C
sub time                            
1   1        0   1   2     3   4   5
    2        6   7   8     9  10  11
2   1       12  13  14    15  16  17
    2       18  19  20    21  22  23
3   1       24  25  26    27  28  29
    2       30  31  32    33  34  35

"""

edited Apr 26 '16 at 08:34

answered Apr 26 '16 at 05:35

B. M.

18,243
2
35
54

still a little terse and off from the concrete problem, but also helpful, thanks – TNT Apr 26 '16 at 07:07
1

I have adapted for a more useful and clear (?) method. – B. M. Apr 26 '16 at 08:37
Cool, didn't know about the pivot_table method! – marcotama Oct 02 '18 at 05:20

score 0 · Answer 3 · answered Apr 26 '16 at 03:49

I still don't know how to do it directly, but here is an easy-to-follow step by step way:

# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)

To check the correct assignment of dimensions:

print(a[:,0,0,0])
[ 0  8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]

print(b)
meas      m1      m2    
cond       A   B   A   B
time sub                
1    1     0   1   2   3
     2     4   5   6   7
2    1     8   9  10  11
     2    12  13  14  15
3    1    16  17  18  19
     2    20  21  22  23

print(c)
meas      m1      m2    
cond       A   B   A   B
sub time                
1   1      0   1   2   3
    2      8   9  10  11
    3     16  17  18  19
2   1      4   5   6   7
    2     12  13  14  15
    3     20  21  22  23

easy multidimensional numpy ndarray to pandas dataframe method?

3 Answers3

Solution

Demonstration

Linked