7

Setup:

pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

output:

                         c           d           e
a           b           
0.439502    0.115087     0.832546    0.760513    0.776555
            0.609107     0.247642    0.031650    0.727773
0.995370    0.299640     0.053523    0.565753    0.857235
            0.392132     0.832560    0.774653    0.213692

Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:

array([[[-1.38655912, -0.90145951, -0.95106951,  0.76570984],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576]],

       [[ 0.0768149 , -0.7566995 , -2.57770951,  0.70834656],
        [-0.99097395, -0.81592084, -1.21075386,  0.12361382]]])

Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?

o1lo01ol1o
  • 440
  • 1
  • 5
  • 13
  • 1
    have you looked into groupby? it performs operations on subsets based on a column or index level and mostly obviates the need for N-D arrays where N > 2 – Phillip Cloud May 05 '14 at 17:36
  • Yes, I can use `pdf.groupby(level=1)` and get a `groupby` object that I can manipulate; however, I need to output everything into numpy arrays. This is the last step in a Pandas pipeline and the next needs arrays as described. – o1lo01ol1o May 05 '14 at 19:56

4 Answers4

21

I just had an extremely similar problem and solved it like this:

a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

output:

array([[[ 0.47780308,  0.93422319,  0.00526572,  0.41645868,  0.82089215],
    [ 0.47780308,  0.15372096,  0.20948369,  0.76354447,  0.27743855]],

   [[ 0.75146799,  0.39133973,  0.25182206,  0.78088926,  0.30276705],
    [ 0.75146799,  0.42182369,  0.01166461,  0.00936464,  0.53208731]]])

verifying it is 3d, a3d.shape gives (2, 2, 5).

Lastly, to make the newly created dimension the last dimension (instead of the first) then use:

a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

which has a shape of (2, 5, 2)


For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.

Example setup with ragged shape:

pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

dataframe:

                        c           d           e
a           b           
0.460013    0.577535    0.299304    0.617103    0.378887
            0.167907    0.244972    0.615077    0.311497
0.318823    0.640575    0.768187    0.652760    0.822311
            0.424744    0.958405    0.659617    0.998765
            0.077048    0.407182    0.758903    0.273737

One possible solution:

n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
                    .apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))

a3d.shape gives (2, 3, 5)

Leo
  • 1,773
  • 12
  • 19
  • Mentionning that when number of timesteps are different, the resulting shape is the number of ID : (2,). We need to pad the other timesteps up to the max number of timesteps (for instance 2 here) to have the right shape displayed (2,2,5). I used keras.preprocessing.sequence.pad_sequences(a3d, maxlen=max_tiùmestep) to do so. – CharlesG Mar 02 '20 at 13:29
  • .as_matrix is deprecated and the documentation says to use .values instead – el Josso Oct 27 '22 at 10:05
3

as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .

import pandas as pd
import numpy as np
from typing import List

def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
    """Make an array cube from a Dataframe

    Args:
        df: Dataframe
        idx_cols: columns defining the dimensions of the cube

    Returns:
        multi-dimensional array
    """
    assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'

    df = df.set_index(keys=idx_cols)  # don't overwrite a parameter, thus copy!
    idx_dims = [len(level) + 1 for level in df.index.levels]
    idx_dims.append(len(df.columns))

    cube = np.empty(idx_dims)
    cube.fill(np.nan)
    cube[tuple(np.array(df.index.to_list()).T)] = df.values

    return cube

Test:


pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

# a, b must be integer 
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give : (2, 2, 3)


pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give s (2, 3, 3) .

Mithril
  • 12,947
  • 18
  • 102
  • 153
2
panel.values

will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).

Jeff
  • 125,376
  • 21
  • 220
  • 187
0

Instead of deprecated .as_matrix or alternativly .values() pandas documentation recommends to use .to_numpy()

'Warning: We recommend using DataFrame.to_numpy() instead.'

ML-ME
  • 33
  • 8