0

sounds similar to the last, but is a different question:

I can create "incremental-growing" samples from a df, by doing this:

# df = { take an average float dataframe of 0.5-1mio rows & 20-50 cols ...}

arr    = np.asarray(df)
res    = list((map(lambda i: arr[:i], range(1,df.shape[0]+1))))

print(res)
>>>[  
    [                                                                                
    ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71,      12087.71  ]  
                                                                                       ],
  [
   ["2019-06-17 08:45:00",     12089.89,     12089.89,   12087.71,      12087.71   ],  
   ["2019-06-17 08:46:00",     12087.91,      np.nan,    12087.71,      12087.91   ]  
                                                                                        ], 
  [
   ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71,      12087.71   ], 
   ["2019-06-17 08:46:00" ,    12087.91  ,     np.nan,    12087.71,      12087.91   ],   
   ["2019-06-17 08:47:00" ,    12088.21  ,   12088.21,    12084.21   ,   12085.21   ]   
                                                                                        ], 
  [
   ["2019-06-17 08:45:00",     12089.89,     12089.89 ,   12087.71 ,     12087.71   ],    
   ["2019-06-17 08:46:00" ,    12087.91 ,     np.nan,     12087.71,      12087.91   ], 
   ["2019-06-17 08:47:00" ,    12088.21 ,    12088.21  ,  12084.21  ,    12085.21   ],    
   ["2019-06-17 08:48:00" ,    12085.09 ,    12090.21  ,  12084.91  ,    12089.41   ] 
                                                                                        ], 
  [
   ["2019-06-17 08:45:00",     12089.89,     12089.89 ,   12087.71  ,    12087.71    ],    
   ["2019-06-17 08:46:00" ,    12087.91 ,    np.nan,      12087.71,      12087.91    ], 
   ["2019-06-17 08:47:00" ,    12088.21 ,    12088.21  ,  12084.21   ,   12085.21    ],   
   ["2019-06-17 08:48:00" ,    12085.09 ,    12090.21  ,  12084.91   ,   12089.41    ],  
   ["2019-06-17 08:49:00" ,    12089.71 ,    12090.21  ,  12087.21   ,   12088.21    ]   
                                                                                        ]
                                                                                                 ]

but they arent equally shaped (intentionally). So I want to fill them with np.nan-rows.

Important: The np.nan-rows can be anywhere in the sample, as long as they dont destroy the original row. -> So they can be in between the rows randomly, but not change the original rows.

TL,DR: I need to keep the order of the original data rows, and not change values in those rows, but otherwise fill the sample with np.nan-rows until they all are same length (->as the longest), no matter where. (and in a time-efficient manner, how?)

Ideal result looks like this: (below you can see another possible outcome with random np.nanrow positioning).

print(new_res)
>>>
[  
  [
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],            
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],                                                                    
   ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71,      12087.71  ]  
                                                                                       ],
  [
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   ["2019-06-17 08:45:00",     12089.89,     12089.89,   12087.71,      12087.71   ],  
   ["2019-06-17 08:46:00",     12087.91,      np.nan,    12087.71,      12087.91   ]  
                                                                                        ], 
  [ 
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71,      12087.71   ], 
   ["2019-06-17 08:46:00" ,    12087.91  ,     np.nan,    12087.71,      12087.91   ],   
   ["2019-06-17 08:47:00" ,    12088.21  ,   12088.21,    12084.21   ,   12085.21   ]   
                                                                                        ], 
  [
   [                np.nan        np.nan        np.nan       np.nan        np.nan  ]  
   ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71 ,     12087.71   ],    
   ["2019-06-17 08:46:00" ,    12087.91 ,     np.nan,     12087.71,      12087.91   ], 
   ["2019-06-17 08:47:00" ,    12088.21 ,    12088.21,    12084.21  ,    12085.21   ],    
   ["2019-06-17 08:48:00" ,    12085.09 ,    12090.21,    12084.91  ,    12089.41   ] 
                                                                                        ], 
  [
   ["2019-06-17 08:45:00",     12089.89,     12089.89 ,   12087.71  ,    12087.71    ],    
   ["2019-06-17 08:46:00" ,    12087.91 ,    np.nan,      12087.71,      12087.91    ], 
   ["2019-06-17 08:47:00" ,    12088.21 ,    12088.21  ,  12084.21   ,   12085.21    ],   
   ["2019-06-17 08:48:00" ,    12085.09 ,    12090.21  ,  12084.91   ,   12089.41    ],  
   ["2019-06-17 08:49:00" ,    12089.71 ,    12090.21  ,  12087.21   ,   12088.21    ]   
                                                                                        ]
                                                                                                 ]

Randomly added np.nan-rows sample:

print(new_res)
>>>
[  
  [
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],                                                                      
   ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71,      12087.71  ] 
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ] 
                                                                                       ],
  [
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   ["2019-06-17 08:45:00",     12089.89,     12089.89,   12087.71,      12087.71   ],   
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   ["2019-06-17 08:46:00",     12087.91,      np.nan,    12087.71,      12087.91   ]  
                                                                                        ], 
  [  
   ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71,      12087.71   ], 
   ["2019-06-17 08:46:00" ,    12087.91  ,     np.nan,    12087.71,      12087.91   ],   
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   ["2019-06-17 08:47:00" ,    12088.21  ,   12088.21,    12084.21   ,   12085.21   ]   
                                                                                        ], 
  [ 
   ["2019-06-17 08:45:00",     12089.89,     12089.89,    12087.71 ,     12087.71   ],    
   ["2019-06-17 08:46:00" ,    12087.91 ,     np.nan,     12087.71,      12087.91   ],  
   [                np.nan,       np.nan,       np.nan,      np.nan,       np.nan  ],  
   ["2019-06-17 08:47:00" ,    12088.21 ,    12088.21,    12084.21  ,    12085.21   ],    
   ["2019-06-17 08:48:00" ,    12085.09 ,    12090.21,    12084.91  ,    12089.41   ] 
                                                                                        ], 
  [
   ["2019-06-17 08:45:00",     12089.89,     12089.89 ,   12087.71  ,    12087.71    ],    
   ["2019-06-17 08:46:00" ,    12087.91 ,    np.nan,      12087.71,      12087.91    ], 
   ["2019-06-17 08:47:00" ,    12088.21 ,    12088.21  ,  12084.21   ,   12085.21    ],   
   ["2019-06-17 08:48:00" ,    12085.09 ,    12090.21  ,  12084.91   ,   12089.41    ],  
   ["2019-06-17 08:49:00" ,    12089.71 ,    12090.21  ,  12087.21   ,   12088.21    ]   
                                                                                        ]
                                                                                                 ]
La-Li-Lu-Le-Low
  • 191
  • 2
  • 15

1 Answers1

1

I think that this might work for you:

arr = np.array(df)

n = arr.shape[0]

ind1, ind2 = np.tril_indices(n)

result = np.full((n, n, arr.shape[1]), np.nan, dtype=object)
result[ind1,ind2,:] = arr[ind2,:]

this gives:

result = 
[[['2019-06-17 08:45:00' 12089.89 12089.89 12087.71 12087.71]
  [nan nan nan nan nan]
  [nan nan nan nan nan]
  [nan nan nan nan nan]
  [nan nan nan nan nan]]

 [['2019-06-17 08:45:00' 12089.89 12089.89 12087.71 12087.71]
  ['2019-06-17 08:46:00' 12087.91 nan      12087.71 12087.91]
  [nan nan nan nan nan]
  [nan nan nan nan nan]
  [nan nan nan nan nan]]

 [['2019-06-17 08:45:00' 12089.89 12089.89 12087.71 12087.71]
  ['2019-06-17 08:46:00' 12087.91 nan      12087.71 12087.91]
  ['2019-06-17 08:47:00' 12088.21 12088.21 12084.21 12085.21]
  [nan nan nan nan nan]
  [nan nan nan nan nan]]

 [['2019-06-17 08:45:00' 12089.89 12089.89 12087.71 12087.71]
  ['2019-06-17 08:46:00' 12087.91 nan      12087.71 12087.91]
  ['2019-06-17 08:47:00' 12088.21 12088.21 12084.21 12085.21]
  ['2019-06-17 08:48:00' 12085.09 12090.21 12084.91 12089.41]
  [nan nan nan nan nan]]

 [['2019-06-17 08:45:00' 12089.89 12089.89 12087.71 12087.71]
  ['2019-06-17 08:46:00' 12087.91 nan      12087.71 12087.91]
  ['2019-06-17 08:47:00' 12088.21 12088.21 12084.21 12085.21]
  ['2019-06-17 08:48:00' 12085.09 12090.21 12084.91 12089.41]
  ['2019-06-17 08:49:00' 12089.71 12090.21 12087.21 12088.21]]]

It's all numpy-based so it should be pretty efficient, but I haven't tested that claim. Also the result is a numpy array of dimension 3 and not a list of matrices, but going numpy → list is pretty easy.

If you want to have your nan's "above" the data, as shown in your example, you can use result[ind1,n-ind1-1+ind2,:] = arr[ind2,:] instead of result[ind1,ind2,:] = arr[ind2,:]

EDIT : some performance tuning

the first obvious optimization would be to use native numpy types, getting rid of the first column:

arr = np.array(df.loc[:,1:])

Rewriting the previous implementation as a function:

def process_the_data(arr):
    n = arr.shape[0]
    ind1, ind2 = np.tril_indices(n)
    result = np.full((n, n, arr.shape[1]), np.nan, dtype=arr.dtype)
    result[ind1,ind2,:] = arr[ind2,:]
    return result

yields a ~2x speed improvement

In order to use numba and have a faster conversion, it is better to rewrite the function using explicit for loops. Using parallel=True (notice the use of prange in the outer loop) gives a small performance boost if the dataset is large enough but is slower for a small dataset

from numba import njit, prange

@njit(parallel=True)
def process_the_data_jit(arr):
    n, m = arr.shape
    result = np.empty((n, n, m), dtype=arr.dtype)
    
    for i in prange(n):
        for j in range(i+1):
            for k in range(m):
                result[i,j,k] = arr[j,k]
        result[i,i+1:,:] = np.nan

    return result

The work here is memory bound, so if you don't need the full 64 bit precision, using float32 will speed the thing up by a factor of ~1.8x.

In summary, with a df containing 2500 rows:

arr_original = np.array(df)
arr = np.array(df.loc[:,1:])

process_the_data(arr_original) #the previous result
process_the_data(arr)
process_the_data_jit(arr)
process_the_data_jit(arr.astype(np.float32))

takes respectively (on my machine)

1.4 s
660 ms
80 ms
50 ms

So it's a nice 17x or 28x speedup depending on the dtype

Miguel
  • 692
  • 5
  • 14
  • is there a way to speed it up with numba? (sry for the late answer I needed rehab from all the stress) – La-Li-Lu-Le-Low Oct 04 '20 at 15:29
  • Maybe but I don't know how well numba plays with arrays of type "object". If you really need speed I would recommend converting the date+time into a numerical value and a use a ndarray of type np.float64, or split the array in 2, one part with the dates, the other with the numerical values – Miguel Oct 05 '20 at 11:41
  • the timevalues were just to visualize it. sry if I created a misunderstanding, they arent used in the array. -> therefore we can work with `float64` I tried @njit -> but I sadly get compilation errors. :( could you give it a try? – La-Li-Lu-Le-Low Oct 05 '20 at 13:53
  • Edited to add some performance optimisations leading to a ~20x speedup – Miguel Oct 06 '20 at 00:09
  • I must tell you however that I feel that this entire computation should be avoided, half of the time is spent writing nans and the rest is copying the same data over and over again, so unless you really need to have all of these in memory simultaneously I would strongly recommend looking at an alternative formulation of your problem – Miguel Oct 06 '20 at 00:12
  • the initial problem is, that I need the Last X values of every subarray, starting at the Xth row. therefore I justify it with Divakars justification functoin ( to squeeze up the nans in every subarray and make sure the valid values are below). <- but for divakars function it sadly has to be a numpy array... and numpy matrices sadly need to have the same shape on every subarray :( I asked a long time ago, but the only thing I found was divakar's justify_nd() function – La-Li-Lu-Le-Low Oct 06 '20 at 16:19
  • Could you maybe link/post the original problem? (I really feel like this computation is just is wasting time) – Miguel Oct 07 '20 at 12:02
  • Didnt find it, but Its quite the same question as this one. The user Divakar replied that I could check out his answer at a similar post: https://stackoverflow.com/questions/44558215/python-justifying-numpy-array He coded a fnuction called "justify_nd()" for this case. Which I am thankful for. Is there a way to do the same with an `array of 2d np.float64 matrices`? (when I try this with an array of type object the function wont work) – La-Li-Lu-Le-Low Oct 07 '20 at 20:40
  • I'm still not sure I understand the problem. If you need the last X rows why not just justify the total array once and then use a view of the lasts X rows for your future computations? – Miguel Oct 12 '20 at 22:44
  • doesnt work since it has to be an "array". but its an array of objects and each "object" (because of the different lengths) is a numpy float32-array. – La-Li-Lu-Le-Low Oct 29 '20 at 14:56
  • I'm sorry, I don't understand what you mean. what has ti be an array, what do you mean by "different length" (the dataframes all have 4 entries in your example) – Miguel Oct 31 '20 at 12:08