Looking for an efficient way to reorganize a pandas Dataframe

Question

I have data in a Dataframe structured like so...

       D           K         Factor        p
0   -0.483128   -1.240024  -1.214765   -1.002418
1   -0.692334   -1.632132   1.562630    0.997304
2   -1.189383   -1.632132   1.562630    0.997304
3   -1.691841   -1.632132   1.562630    0.997304
4   -2.084926   -1.632132   1.562630    0.997304

I'm trying to reorganize the data into a new structure where each row contains 'period' number of rows from the existing data. Moves ahead one row in existing and the stacks the next 'period lines'.

the function so far:

def prepData(seq, period):
    newStacks = pd.DataFrame()
    for pos in range(0, len(seq) - (period+1), 1):
        chunk = (seq[pos:pos + period])
        stack = []
        for sliver in range(0, len(chunk), 1):
            piece = (chunk.iloc[sliver:])
            print(piece)
            stack.append(piece) 
        
        newStacks.append(chunk)

return newStacks

this obviously is not efficient and aside doesn't produce the desire structure. The aim is to get a structure like below considering period = 3

0   -0.483128   -1.240024   -1.214765   -1.002418   -0.692334   -1.632132   1.562630    0.997304 -1.189383  -1.632132   1.562630    0.997304
1   -0.692334   -1.632132   1.562630    0.997304    -1.189383   -1.632132   1.562630    0.997304 -1.691841  -1.632132   1.562630    0.997304
2   -1.189383   -1.632132   1.562630    0.997304    -1.691841   -1.632132   1.562630    0.997304 -2.084926  -1.632132   1.562630    0.997304

AA Simple way to accomplish this would be appreciated.

You want the result to be a dataframe or a numpy array or a list? — Joe Ferndz, Apr 07 '21 at 05:24

Joe Ferndz · Answer 1 · 2021-04-07T06:10:54.900

I am not sure if you want to create a dataframe or a list. Here's the code to get you both.

import pandas as pd
c = ['D','K','Factor','p']
d = [[-0.483128,   -1.240024,  -1.214765,   -1.002418],
[-0.692334,   -1.632132,   1.562630,    0.997304],
[-1.189383,   -1.632132,   1.562630,    0.997304],
[-1.691841,   -1.632132,   1.562630,    0.997304],
[-2.084926,   -1.632132,   1.562630,    0.997304]]
df = pd.DataFrame(d,columns=c)
print (df)

p = 3 #this is the period you wanted. I set it to 3
stack_list = []   #this will store the final stacked list
                  #note: don't use stack, its used by pandas to stack

for i in range(len(df)-p+1):  #iterating thru the dataframe
    # convert p rows to a list after you stack them
    chunk = df.loc[i:i+p-1].stack().reset_index(level=1,drop=True).tolist()

    stack_list.append(chunk)   #store chunk to stack_list


df1 = pd.DataFrame(stack_list)   #creating a dataframe as per your request

#printing both stack_list and dataframe

print (stack_list)
print (df1)

The output of this will be:

The original dataframe is:

          D         K    Factor         p
0 -0.483128 -1.240024 -1.214765 -1.002418
1 -0.692334 -1.632132  1.562630  0.997304
2 -1.189383 -1.632132  1.562630  0.997304
3 -1.691841 -1.632132  1.562630  0.997304
4 -2.084926 -1.632132  1.562630  0.997304

The stacked list is:

[[-0.483128, -1.240024, -1.214765, -1.002418, -0.692334, -1.632132, 1.56263, 0.997304, -1.189383, -1.632132, 1.56263, 0.997304], 
 [-0.692334, -1.632132, 1.56263, 0.997304, -1.189383, -1.632132, 1.56263, 0.997304, -1.691841, -1.632132, 1.56263, 0.997304], 
 [-1.189383, -1.632132, 1.56263, 0.997304, -1.691841, -1.632132, 1.56263, 0.997304, -2.084926, -1.632132, 1.56263, 0.997304]]

The new dataframe that you wanted to get created is:

         0         1         2   ...        9        10        11
0 -0.483128 -1.240024 -1.214765  ... -1.632132  1.56263  0.997304
1 -0.692334 -1.632132  1.562630  ... -1.632132  1.56263  0.997304
2 -1.189383 -1.632132  1.562630  ... -1.632132  1.56263  0.997304

Thanks for this, Joe. I upvoted your answers as it is a good solution. I chose jezrael's solution for my implementation. — Tim, Apr 07 '21 at 23:43

score 2 · Accepted Answer · answered Apr 07 '21 at 06:17

If there are all floats columns for improve performance use strides in numpy with reshape 3d array to 2d and pass to DataFrame constructor:

#https://stackoverflow.com/a/44306231/2901002 a bit changed
def strided_lastaxis(a, L):
    s0,s1 = a.strides
    m,n = a.shape
    return np.lib.stride_tricks.as_strided(a, shape=(m-L+1,L,n), strides=(s0,s0,s1))

a = strided_lastaxis(df.to_numpy(), 3)

df1 = pd.DataFrame(a.reshape(a.shape[0], -1))
print (df1)
         0         1         2         3         4         5        6   \
0 -0.483128 -1.240024 -1.214765 -1.002418 -0.692334 -1.632132  1.56263   
1 -0.692334 -1.632132  1.562630  0.997304 -1.189383 -1.632132  1.56263   
2 -1.189383 -1.632132  1.562630  0.997304 -1.691841 -1.632132  1.56263   

         7         8         9        10        11  
0  0.997304 -1.189383 -1.632132  1.56263  0.997304  
1  0.997304 -1.691841 -1.632132  1.56263  0.997304  
2  0.997304 -2.084926 -1.632132  1.56263  0.997304

you continue to amaze me with your deep knowledge in pandas/numpy/python. Thx for sharing this. Didn't know about strides. — Joe Ferndz, Apr 07 '21 at 06:22
@JoeFerndz - Honestly numpy is still hard to me, still learning. Pandas is better for me, because I more coding with it, it means more experience. — jezrael, Apr 07 '21 at 06:24

Looking for an efficient way to reorganize a pandas Dataframe

2 Answers2