Pandas Dataframe or Panel to 3d numpy array

Question

Setup:

pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

output:

                         c           d           e
a           b           
0.439502    0.115087     0.832546    0.760513    0.776555
            0.609107     0.247642    0.031650    0.727773
0.995370    0.299640     0.053523    0.565753    0.857235
            0.392132     0.832560    0.774653    0.213692

Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:

array([[[-1.38655912, -0.90145951, -0.95106951,  0.76570984],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576]],

       [[ 0.0768149 , -0.7566995 , -2.57770951,  0.70834656],
        [-0.99097395, -0.81592084, -1.21075386,  0.12361382]]])

Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?

have you looked into groupby? it performs operations on subsets based on a column or index level and mostly obviates the need for N-D arrays where N > 2 — Phillip Cloud, May 05 '14 at 17:36
Yes, I can use `pdf.groupby(level=1)` and get a `groupby` object that I can manipulate; however, I need to output everything into numpy arrays. This is the last step in a Pandas pipeline and the next needs arrays as described. — o1lo01ol1o, May 05 '14 at 19:56

Leo · Answer 1 · 2020-03-02T19:18:17.427

I just had an extremely similar problem and solved it like this:

a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

output:

array([[[ 0.47780308,  0.93422319,  0.00526572,  0.41645868,  0.82089215],
    [ 0.47780308,  0.15372096,  0.20948369,  0.76354447,  0.27743855]],

   [[ 0.75146799,  0.39133973,  0.25182206,  0.78088926,  0.30276705],
    [ 0.75146799,  0.42182369,  0.01166461,  0.00936464,  0.53208731]]])

verifying it is 3d, a3d.shape gives (2, 2, 5).

Lastly, to make the newly created dimension the last dimension (instead of the first) then use:

a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

which has a shape of (2, 5, 2)

For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.

Example setup with ragged shape:

pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

dataframe:

                        c           d           e
a           b           
0.460013    0.577535    0.299304    0.617103    0.378887
            0.167907    0.244972    0.615077    0.311497
0.318823    0.640575    0.768187    0.652760    0.822311
            0.424744    0.958405    0.659617    0.998765
            0.077048    0.407182    0.758903    0.273737

One possible solution:

n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
                    .apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))

a3d.shape gives (2, 3, 5)

Mentionning that when number of timesteps are different, the resulting shape is the number of ID : (2,). We need to pad the other timesteps up to the max number of timesteps (for instance 2 here) to have the right shape displayed (2,2,5). I used keras.preprocessing.sequence.pad_sequences(a3d, maxlen=max_tiùmestep) to do so. — CharlesG, Mar 02 '20 at 13:29
.as_matrix is deprecated and the documentation says to use .values instead — el Josso, Oct 27 '22 at 10:05

Mithril · Answer 2 · 2021-04-01T15:10:51.100

as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .

import pandas as pd
import numpy as np
from typing import List

def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
    """Make an array cube from a Dataframe

    Args:
        df: Dataframe
        idx_cols: columns defining the dimensions of the cube

    Returns:
        multi-dimensional array
    """
    assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'

    df = df.set_index(keys=idx_cols)  # don't overwrite a parameter, thus copy!
    idx_dims = [len(level) + 1 for level in df.index.levels]
    idx_dims.append(len(df.columns))

    cube = np.empty(idx_dims)
    cube.fill(np.nan)
    cube[tuple(np.array(df.index.to_list()).T)] = df.values

    return cube

Test:


pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

# a, b must be integer 
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give : (2, 2, 3)


pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give s (2, 3, 3) .

Unfortunately this does not work if the dataframe contains strings. — MonsieurWave, May 17 '22 at 12:28

score 2 · Accepted Answer · answered May 05 '14 at 22:11

2

panel.values

will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).

answered May 05 '14 at 22:11

Jeff

125,376
21
220
187

score 0 · Answer 4 · answered Nov 09 '21 at 08:16

0

Instead of deprecated .as_matrix or alternativly .values() pandas documentation recommends to use .to_numpy()

'Warning: We recommend using DataFrame.to_numpy() instead.'

answered Nov 09 '21 at 08:16

ML-ME

33
8

Pandas Dataframe or Panel to 3d numpy array

4 Answers4

Linked