5

I want to create a Data Frame from a dictionary where the values are 2D numpy array.

my_Dict={'a': array([[1, 2, 3],[4, 5, 6]]), 'b': array([[7,8,9],[10,11,12]]),'c': array([[13,14,15],[16,17,18]])}

I expect the outcome to be a Data frame with 2 rows( number of rows in numpy array)and 3 column as below:

       a         b          c

0  [1, 2, 3]   [7,8,9]    [13,14,15]

1  [4, 5, 6]  [10,11,12] [16,17,18]

I tried changing the values to list and it worked. but I want to keep the values as np array for applying numby functions to the values.

SuperKogito
  • 2,998
  • 3
  • 16
  • 37
bahar
  • 67
  • 2
  • 6
  • Just wondering, would all values in a column be of same length? (Because if yes, you'll be a lot better off saving them as 3 columns instead of 1, and still be able to use all numpy operations on the underlying arrays) – Paritosh Singh Apr 28 '19 at 13:36
  • Thanks for your comment. I want to merge this DataFrame later with another one and the columns represent the values of different attributes of some outcomes. thats why it is important for me that each column refers to a single attribute. – bahar Apr 28 '19 at 13:56
  • in that case, let me write up a suggestion for you to use here. – Paritosh Singh Apr 28 '19 at 13:58

2 Answers2

2
>>> list(np.array([[1, 2, 3],[4, 5, 6]]))
[array([1, 2, 3]), array([4, 5, 6])]
>>>

Transform each column's 2-d array into a list of two 1-d arrays

d = {'a': np.array([[1, 2, 3],[4, 5, 6]]),
      'b': np.array([[7,8,9],[10,11,12]]),
      'c': np.array([[13,14,15],[16,17,18]])}

df = pd.DataFrame({k:list(v) for k,v in d.items()})

>>> df
           a             b             c
0  [1, 2, 3]     [7, 8, 9]  [13, 14, 15]
1  [4, 5, 6]  [10, 11, 12]  [16, 17, 18]
>>> 

>>> df.loc[0,'a']
array([1, 2, 3])
>>> df['a'].values
array([array([1, 2, 3]), array([4, 5, 6])], dtype=object)
>>> df.values
array([[array([1, 2, 3]), array([7, 8, 9]), array([13, 14, 15])],
       [array([4, 5, 6]), array([10, 11, 12]), array([16, 17, 18])]],
      dtype=object)
>>>
wwii
  • 23,232
  • 7
  • 37
  • 77
2

Perhaps, tackling into why you'd want to do this, i would instead recommend making a multilevel dataframe.

Given:

import numpy as np
myDict = {'a': np.array([[1, 2, 3],[4, 5, 6]]),
          'b': np.array([[7,8,9],[10,11,12]]),
          'c': np.array([[13,14,15],[16,17,18]])}

Turn each array into an individual dataframe, and concat to get a 2 level df.

df = pd.concat([pd.DataFrame(v) for k, v in myDict.items()], axis = 1, keys = list(myDict.keys()))

print(df)
   a         b           c        
   0  1  2   0   1   2   0   1   2
0  1  2  3   7   8   9  13  14  15
1  4  5  6  10  11  12  16  17  18

This allows the internal structures of the dataframe to be numpy arrays instead of dealing with objects. (This helps with the speed of some operations, instead of always resorting to iteration during operations on the column with a datatype of object.)

You can index normally still:

print(df['a'])
   0  1  2
0  1  2  3
1  4  5  6

And also do operations on the underlying numpy arrays, either directly or using .values

df['a'] = df['a'].values * 10

print(df)
    a           b           c        
    0   1   2   0   1   2   0   1   2
0  10  20  30   7   8   9  13  14  15
1  40  50  60  10  11  12  16  17  18
Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
  • `...internal structures of the dataframe to be numpy arrays instead of dealing with objects.` - why is that advantageous? – wwii Apr 28 '19 at 14:41
  • am i mistaken in saying that? – Paritosh Singh Apr 28 '19 at 14:50
  • ?? I was asking - I don't use MultiIndexed/heirarchical DataFrames/Series and don't have a good understanding. Intuitively I think there is an advantage over my solution that produces a DataFrame of objects. – wwii Apr 28 '19 at 15:01
  • ah. I am afraid i do not know the specifics as to why its better, sorry, but have been informed in the past that objects hinder a lot of operations, making some things default to iteration during column operations, and take up more space to store since the dataframe cannot infer the data types. I do not know how well that generalizes to objects of numpy arrays however, and my dummy timeit operations seem to be equal in speed. – Paritosh Singh Apr 28 '19 at 15:22
  • 1
    Ok yep, coldspeed confirmed it [here](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care) Partial Quote: "and all operations on objects fall back to a slow, loopy implementation." – Paritosh Singh Apr 29 '19 at 05:50
  • 2
    That was Good - thnx coldspeed :). Even without knowing *what* operations the OP will be doing, I suspect that **at best** my solution would have the same performance and most probably would be worse than operations with your solution. – wwii Apr 29 '19 at 15:40