40

I'm having difficulty constructing a 3D DataFrame in Pandas. I want something like this

A               B               C
start    end    start    end    start    end ...
7        20     42       52     90       101
11       21                     213      34
56       74                     9        45
45       12

Where A, B, etc are the top-level descriptors and start and end are subdescriptors. The numbers that follow are in pairs and there aren't the same number of pairs for A, B etc. Observe that A has four such pairs, B has only 1, and C has 3.

I'm not sure how to proceed in constructing this DataFrame. Modifying this example didn't give me the designed output:

import numpy as np
import pandas as pd

A = np.array(['one', 'one', 'two', 'two', 'three', 'three'])
B = np.array(['start', 'end']*3)
C = [np.random.randint(10, 99, 6)]*6
df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C'])
df.set_index(['A', 'B'], inplace=True)
df

yielded:

                C
 A          B   
 one        start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]
 two        start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]
 three      start   [22, 19, 16, 20, 63, 54]
              end   [22, 19, 16, 20, 63, 54]

Is there any way of breaking up the lists in C into their own columns?

EDIT: The structure of my C is important. It looks like the following:

 C = [[7,11,56,45], [20,21,74,12], [42], [52], [90,213,9], [101, 34, 45]]

And the desired output is the one at the top. It represents the starting and ending points of subsequences within a certain sequence (A, B. C are the different sequences). Depending on the sequence itself, there are a differing number of subsequences that satisfy a given condition I'm looking for. As a result, there are a differing number of start:end pairs for A, B, etc

Community
  • 1
  • 1
tlnagy
  • 3,274
  • 4
  • 24
  • 37

3 Answers3

17

First, I think you need to fill C to represent missing values

In [341]: max_len = max(len(sublist) for sublist in C)
In [344]: for sublist in C:
     ...:     sublist.extend([np.nan] * (max_len - len(sublist)))

In [345]: C
Out[345]: 
[[7, 11, 56, 45],
 [20, 21, 74, 12],
 [42, nan, nan, nan],
 [52, nan, nan, nan],
 [90, 213, 9, nan],
 [101, 34, 45, nan]]

Then, convert to a numpy array, transpose, and pass to the DataFrame constructor along with the columns.

In [288]: C = np.array(C)
In [289]: df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))

In [349]: df
Out[349]: 
     one         two       three     
   start  end  start  end  start  end
0      7   20     42   52     90  101
1     11   21    NaN  NaN    213   34
2     56   74    NaN  NaN      9   45
3     45   12    NaN  NaN    NaN  NaN
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • My data is organized as a list of lists so that `C=[[...],[...],[...]...]` since each nested list has a different length. How could I handle this situation? – tlnagy Jun 18 '14 at 17:12
  • 1
    This implementation is giving me an error because the length of the nested lists within `C` is not equal to length of `A` and `B` – tlnagy Jun 18 '14 at 17:17
  • What does each list represent, rows or columns? Why are they different lengths? Are the shorter lists supposed to be missing certain elements? See edited answer for a guess. – chrisb Jun 18 '14 at 17:18
  • The values in each nested list are the rows and the nested list themselves are the columns. The length of the columns is different because `one` has a different number of start:end pairs than `two` – tlnagy Jun 18 '14 at 17:28
  • I think we're getting tangled on terminology - can you edit your question to provide some data that matches what you're talking about, and then show what output you want? – chrisb Jun 18 '14 at 17:34
  • I added the structure of my `C` to the question. The desired output was what was shown at the top. Thanks for the help! – tlnagy Jun 18 '14 at 17:45
  • I added an extra couple clarifying sentences to the description as well. I'm open to redesigning `C`'s structure, but I'm not aware of a better way to represent the data. – tlnagy Jun 18 '14 at 17:55
  • It looks like it worked! Is there no better way of making a 2d numpy array from arrays of non-identical lengths? – tlnagy Jun 18 '14 at 19:29
  • Woah, I really like this question and answer. 3D Dataframes opens a lot of doors. – O.rka Apr 08 '16 at 18:44
  • In Python3 it seems you need to do ...list(zip(A,B)) – Reed Richards Mar 23 '17 at 23:24
14

As @Aaron mentioned in a comment above, panels have been deprecated. Also, @tlnagy mentioned his dataset would be likely to expand to more than 3 dimensions in the future.

This sounds like a good use-case for the xarray package, which provides semantically labelled arrays of arbitrarily many dimensions. Pandas and xarray have strong conversion support, and panels have been deprecated in favour of using xarray.

Initial setup of the problem.

import numpy as np

A = np.array([[7,11,56,45], [20,21,74,12]]).T
B = np.array([[42], [52]]).T
C = np.array([[90,213,9], [101, 34, 45]]).T

You can then create a three dimensional xarray.DataArray object like so:

import xarray

output_as_dataarray = xarray.concat(
    [
        xarray.DataArray(
            X,
            dims=["record", "edge"],
            coords={"record": range(X.shape[0]), "edge": ["start", "end"]},
        )
        for X in (A, B, C)
    ],
    dim="descriptor",
).assign_coords(descriptor=["A", "B", "C"])

We turn our three 2D numpy arrays into xarray.DataArray objects, and then concatenate them together along a new dimension.

Our output looks like so:

<xarray.DataArray (descriptor: 3, record: 4, edge: 2)>
array([[[  7.,  20.],
        [ 11.,  21.],
        [ 56.,  74.],
        [ 45.,  12.]],

       [[ 42.,  52.],
        [ nan,  nan],
        [ nan,  nan],
        [ nan,  nan]],

       [[ 90., 101.],
        [213.,  34.],
        [  9.,  45.],
        [ nan,  nan]]])
Coordinates:
  * record      (record) int64 0 1 2 3
  * edge        (edge) <U5 'start' 'end'
  * descriptor  (descriptor) <U1 'A' 'B' 'C'
scottclowe
  • 2,015
  • 19
  • 20
4

Can't you just use a panel?

import numpy as np
import pandas as pd

A = ['one', 'two' ,'three']
B = ['start','end']
C = [np.random.randint(10, 99, 2)]*6
df = pd.DataFrame(C,columns=B  )
p={}
for a in A:
    p[a]=df
panel= pd.Panel(p)
print panel['one']
user3684792
  • 2,542
  • 2
  • 18
  • 23
  • 3
    It's likely that my dataset will be higher dimensional in the future. Isn't panel limited to 3 dimensions? – tlnagy Jun 18 '14 at 16:57
  • 25
    Note: Panels are now [deprecated](https://pandas.pydata.org/pandas-docs/stable/whatsnew.html#deprecate-panel)! – Aaron N. Brock Apr 02 '18 at 18:48
  • 4
    Updated deprecation reference: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.23.0.html#deprecate-panel – Ethan T Jan 31 '20 at 14:29