Short version:
I'm trying to efficiently create an array like x
:
input = [0, 1, 2, 3, 4, 5, 6]
x = [ [0,1,2], [1,2,3], [2,3,4], [3,4,5], [4,5,6] ]
I've tried simple for
looping and it takes too long for the real usecase.
Long version:
(extends short version)
I've got a 400k rows long dataframe, which I need to partition into arrays of a next n
elements from the element currently iterated over. Currently I group it just like presented below in the process_data
function.
A simple for
based iteration takes forever here (2.5min on my hardware to be specific). I've searched itertools
and pandas
documentation, tried searching here too and couldn't find any fitting solution.
My current super time consuming implementation:
class ModelInputParsing(object):
def __init__(self, data):
self.parsed_dataframe = data.fillna(0)
def process_data(self, lb=50):
self.X, self.Y = [],[]
for i in range(len(self.parsed_dataframe)-lb):
self.X.append(self.parsed_dataframe.iloc[i:(i+lb),-2])
self.Y.append(self.parsed_dataframe.iloc[(i+lb),-1])
return (np.array(self.X), np.array(self.Y))
The input data looks like this (where Bid
is the mentioned input
):
Bid Changes Expected
0 1.20102 NaN 0.000000
1 1.20102 0.000000 0.000000
2 1.20102 0.000000 0.000042
3 1.20102 0.000000 0.000017
4 1.20102 0.000000 0.000025
5 1.20102 0.000000 0.000025
6 1.20102 0.000000 0.000100
...
And the output should look like this:
array([[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
8.34465027e-06, -8.34465027e-06, 0.00000000e+00],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
-8.34465027e-06, 0.00000000e+00, 3.33786011e-05],
[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 3.33786011e-05, 0.00000000e+00],
...,
[ 0.00000000e+00, 8.34465027e-06, 1.66893005e-05, ...,
-8.34465027e-06, 0.00000000e+00, 0.00000000e+00],
[ 8.34465027e-06, 1.66893005e-05, -8.34465027e-06, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[ 1.66893005e-05, -8.34465027e-06, 0.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 1.66893005e-05]], dtype=float32)
len(x)
399950
Below I've presented x[0]
and x[1]
. Key here is how the the values move one position back in the next array. For example a first non-zero value moved from 7
to 6
position (0 based position).
The first element:
x[0]
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -4.16040421e-05, 2.49147415e-05,
-8.34465027e-06, 0.00000000e+00, -7.49230385e-05,
...,
2.50339508e-05, -8.34465027e-06, 3.33786011e-05,
-2.50339508e-05, -8.34465027e-06, 8.34465027e-06,
-8.34465027e-06, 0.00000000e+00], dtype=float32)
len(x[0])
50
The second element:
x[1]
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
-4.16040421e-05, 2.49147415e-05, -8.34465027e-06,
0.00000000e+00, -7.49230385e-05, -1.58131123e-04,
....,
-8.34465027e-06, 3.33786011e-05, -2.50339508e-05,
-8.34465027e-06, 8.34465027e-06, -8.34465027e-06,
0.00000000e+00, 3.33786011e-05], dtype=float32)
len(x[1])
50
I'm curious if there is a way to get this done more efficiently as I'm soon planning to parse +20m rows long datasets.