0

Consider the following code to create a dummy dataset

import numpy as np
from scipy.stats import norm
import pandas as pd

np.random.seed(10)

n=3

space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)

values = np.kron(space, time).reshape(n,n) + norm(1,1).rvs([n,n])

### Output  
array([[267.39784458, 300.81493866, 229.19163206],
       [236.1940266 , 266.49469945, 204.01294305],
       [122.55912977, 140.00957047, 106.28339745]])

I can put these data in a pandas dataframe using

space_names = ['A','B','C']
time_names = [2000,2001,2002]

df = pd.DataFrame(values, index=space_names, columns=time_names)
df  

### Output

    2000        2001        2002
A   267.397845  300.814939  229.191632
B   236.194027  266.494699  204.012943
C   122.559130  140.009570  106.283397

This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe

df.columns.name = 'time'
df.index.name = 'space'

df.stack().rename('value').reset_index()  
### Output  
    space   time    value
0   A      2000    267.397845
1   A      2001    300.814939
2   A      2002    229.191632
3   B      2000    236.194027
4   B      2001    266.494699
5   B      2002    204.012943
6   C      2000    122.559130
7   C      2001    140.009570
8   C      2002    106.283397

My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple

s = 3
t = 4
r = 2

space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)

values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r) + norm(0,1).rvs([s,t,r])
values

### Output 
array([[[286.50322099, 288.51266345],
        [176.64303485, 175.38175877],
        [136.01675917, 134.44328617]],

       [[187.07608546, 185.4068411 ],
        [112.86398438, 111.983463  ],
        [ 85.99035255,  86.67236986]],

       [[267.66833894, 269.45295404],
        [162.30044715, 162.50564386],
        [124.6374401 , 126.2315447 ]]])

How can I obtain the same structure for the dataframe as above?

Ugly solution

Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.

labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition

def flatten3d(k):
    return [i for l in k for s in l for i in s]

value_series = pd.Series(flatten3d(values)).rename('y')

split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])

pd.concat([df, value_series], axis=1)

### Output  
    s   t   r   y
0   0   0   0   266.2408815208753
1   0   0   1   266.13662442609433
2   0   1   0   299.53178992512954
3   0   1   1   300.13941632567605
4   0   2   0   229.39037800681405
5   0   2   1   227.22227496248507
6   0   3   0   281.76357915411995
7   0   3   1   280.9639352062619
8   1   0   0   235.8137644198259
9   1   0   1   234.23202459516452
10  1   1   0   265.19681013560034
11  1   1   1   266.5462102589883
12  1   2   0   200.730100791878
13  1   2   1   199.83217739700535
14  1   3   0   246.54018839875374
15  1   3   1   248.5496308586532
16  2   0   0   124.90916276929234
17  2   0   1   123.64788669199066
18  2   1   0   139.65391860786775
19  2   1   1   138.08044561039517
20  2   2   0   106.45276370157518
21  2   2   1   104.78351933651582
22  2   3   0   129.86043618610572
23  2   3   1   128.97991481257253
StackyGuy
  • 37
  • 7
  • https://stackoverflow.com/questions/36235180/efficiently-creating-a-pandas-dataframe-from-a-numpy-3d-array – strawdog Nov 24 '22 at 15:22
  • My example is a bit different. As you can see the author of that question started with an array of total length greater than the one of the output dataframe. In my case it should be even easier since the length is already the same as the output dataframe (`n*n*rep`). My problem is that i don't want to lose the sorting and i'd like to automatically assign space and time columns like pd.stack does – StackyGuy Nov 24 '22 at 15:43

1 Answers1

2

This does not use stack, but maybe it is acceptable for your problem:

import numpy as np
import pandas as pd

values = np.arange(18).reshape(3, 3, 2) # Your values here
index  = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])

df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()

# df:
#    space1 space2  time  value
# 0       A      A  2000      0
# 1       A      A  2001      1
# 2       A      B  2000      2
# 3       A      B  2001      3
# 4       A      C  2000      4
# 5       A      C  2001      5
# 6       B      A  2000      6
# 7       B      A  2001      7
# 8       B      B  2000      8
# 9       B      B  2001      9
# 10      B      C  2000     10
# 11      B      C  2001     11
# 12      C      A  2000     12
# 13      C      A  2001     13
# 14      C      B  2000     14
# 15      C      B  2001     15
# 16      C      C  2000     16
# 17      C      C  2001     17
Chrysophylaxs
  • 5,818
  • 3
  • 10
  • 21
  • Perfect example of MultiIndex in pandas, thanks i did not think of that. Also np.array.ravel() is a method very useful that i did not know of. – StackyGuy Nov 24 '22 at 16:22