2

I have a pandas column storing a np array in each row. The df looks like this:

0    [38, 324, -21]
1    [41, 325, -19]
2    [41, 325, -19]
3    [42, 326, -20]
4    [42, 326, -19]

I want to convert this column into a np array so I can use it as training data for a model. I convert it to one np array with this:

arr = df.c.values

Now, I would except the shape of this array to be (5,3). However, when I run:

arr.shape

I get this:

(5,)

Further, if I run:

arr[0].shape

I get (3,).

Why don't I just get shape (5,3) when I run arr.shape?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
connor449
  • 1,549
  • 2
  • 18
  • 49
  • 4
    `arr = np.array(df.c.to_list())`. – Quang Hoang Dec 08 '21 at 21:38
  • 1
    It is object dtype., 5 separate arrays, not one 2d one. That's how they are stored in the frame. – hpaulj Dec 08 '21 at 21:39
  • 1
    Since there's no guarantee all elements in the column have the same length, `.values` will not return a 2D array for you. However you can manually construct an array like @QuangHoang commented. – Psidom Dec 08 '21 at 21:39
  • @QuangHoang Ahh, you are right. Forget `df.c.values` is already a numpy array. – Psidom Dec 08 '21 at 21:43
  • Maybe `df[['c']].values` –  Dec 08 '21 at 21:44
  • I think this question is actually answered, and is about [how to convert a pandas table to a numpy array](https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array/). – D A Dec 08 '21 at 23:04
  • Does this answer your question? [Convert pandas dataframe to NumPy array](https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array) – D A Dec 08 '21 at 23:05
  • I think that the thing with OP here is that each row of colum `"c"` is a numpy array. So `df.c.values` nor df.c.to_numpy()` wont give the desired result. – Andre Dec 09 '21 at 09:11

1 Answers1

3

You can take a look at what df.c.values actually is by seeing what the output is:

import numpy as np
import pandas as pd

df = pd.DataFrame()
df['c'] = [np.random.randint(0, 10, 3) for i in range(5)]
In [2]: df
Out[2]:
    c
0   [-80, 4, -84]
1   [88, 32, 85]
2   [-11, 71, 37]
3   [-78, 93, 50]
4   [30, 29, 28]
In[3]: df.c.values
Out[3]: 
array([array([-80,   4, -84]), array([88, 32, 85]),
       array([-11,  71,  37]), array([-78,  93,  50]),
       array([30, 29, 28])], dtype=object)

So df.c.values is an 1 dimensional array containing 5 individual arrays (hence df.c.values.shape == (5,)), and not a 2d array.

To get a nd array you need to combine/stack them into one nd array. A straightforward way is to np.vstack() them:

arr = np.vstack(df.c.values)
arr.shape == (5,3)
Andre
  • 760
  • 3
  • 13