0

I have a need to do very very fast and efficient way of rolling linear regression. I looked through these two threads :

Efficient way to do a rolling linear regression Rolling linear regression

From them, I had inferred numpy was (computationally) the fastest. However, using my (limited) python skills, I found the time to compute the same set of rolling data, was *** the same ***.

Is there a faster way to compute than either of the 3 methods I post below? I would have thought the numpy way is much faster, but unfortunately, it wasn't.

########## testing time for pd rolling vs numpy rolling

def fitcurve(x_pts):
    poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
    return np.poly1d(poly)[1]


win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()

########
s_time = time.time()
roll_np = np.empty(0)
for cnt_ in range(len(tmp_)-win_):
    tmp1_ = tmp_[cnt_:cnt_+ win_]
    grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
    roll_np = np.append(roll_np, grad_)

print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()

#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
    slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
    roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()
Kiann
  • 531
  • 1
  • 6
  • 20

2 Answers2

1

tl;dr

My answer is

view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T @ xxx) @ (xxx.T) @ view.T)[0]

And it takes 1.2 ms to compute, compared to 2 seconds for your pandas and numpy version, and 3.5 seconds for your stat version.

Long version

One method could be to use sliding_window_view to transform your tmp_ array, into an array of window (a fake one: it is just a view, not really a 10000x30 array of data. It is just tmp_ but viewed differenty. Hence the _view in the function name).

No direct advantage. But then, from there, you can try to take advantage of vectorization.

I do that two different way: an easy one, and one that takes a minute of thinking. Since I put the best answer first, the rest of this message can appear inconsistent chronologically (I say things like "in my previous answer" when the previous answer come later), but I tried to redact both answer consistently.

New answer : matrix operations

One method to do that (since lstsq is of the rare numpy method that wouldn't just do it naturally) is to go back to what lstsq(X,Y) does in reality: it computes (XᵀX)⁻¹Xᵀ Y

So let's just do that. In python, with xxx being the X array (of arange and 1 in your example) and view the array of windows to your data (that is view[i] is tmp_[i:i+win_]), that would be np.linalg.inv(xxx.T@xxx)@xxx.T@view[i] for i being each row. We could vectorize that operation with np.vectorize to avoid iterating i, as I did for my first solution (see below). But the thing is, we don't need to. That is just a matrix times a vector. And the operation computing a matrix times a vector for each vector in an array of vectors, is just matrix multiplication!

Hence my 2nd (and probably final) answer

view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
roll_mat=(np.linalg.inv(xxx.T @ xxx) @ (xxx.T) @ view.T)[0]

roll_mat is still identical (with one extra row because your roll_np stopped one row short of the last possible one) to roll_np (see below for graphical proof with my first answer. I could provide a new image for this one, but it is indistinguishable from the one I already used). So same result (unsurprisingly I should say... but sometimes it is still a surprise when things work exactly like theory says they do)

But timing, is something else. As promised, my previous factor 4 was nothing compared to what real vectorization can do. See updated timing table:

Method Time
pandas 2.10 s
numpy roll 2.03 s
stat 3.58 s
numpy view/vectorize (see below) 0.46 s
numpy view/matmult 1.2 ms

The important part is 'ms', compared to other 's'. So, this time factor is 1700 !

Old-answer : vectorize

A lame method, once we have this view could be to use np.vectorize from there. I call it lame because vectorize is not supposed to be efficient. It is just a for loop called by another name. Official documentation clearly says "not to be used for performance". And yet, it would be an improvement from your code

view = np.lib.stride_tricks.sliding_window_view(tmp_, (win_,))
xxx=np.vstack([np.arange(win_), np.ones(win_)]).T
f = np.vectorize(lambda y: np.linalg.lstsq(xxx,y,rcond=None)[0][0], signature='(n)->()')
roll_vectorize=f(view)

Firt let's verify the result

plt.scatter(f(view)[:-1], roll_np))

enter image description here

So, obviously, same results as roll_np (which, I've checked the same way, are the same results as the two others. With also the same variation on indexing since all 3 methods have not the same strategy for border)

And the interesting part, timings:

Method Time
pandas 2.10 s
numpy roll 2.03 s
stat 3.58 s
numpy view/vectorize 0.46 s

So, you see, it is not supposed to be for performance, and yet, I gain more that x4 times with it.

I am pretty sure that a more vectorized method (alas, lstsq doesn't allow directly it, unlike most numpy functions) would be even faster.

chrslg
  • 9,023
  • 5
  • 17
  • 31
  • 1
    wow, great answer @chslg ! Thanks for your help... The time-factor difference is *** insane *** – Kiann Jan 12 '23 at 12:35
0

First if you need some tips for optimizing your python code, I believe this playlist might help you.

For making it faster; "Append" is never a good way, you think of it in terms of memory, every time you append, python may create a completely new list with a bigger size (maybe n+1; where n is old size) and copy the last items (which will be n places) and for the last one will be added at last place.

So when I changed it to be as follows

########## testing time for pd rolling vs numpy rolling

def fitcurve(x_pts):
    poly = np.polyfit(np.arange(len(x_pts)), x_pts, 1)
    return np.poly1d(poly)[1]


win_ = 30
# tmp_ = data_.Close
tmp_ = pd.Series(np.random.rand(10000))
s_time = time.time()
roll_pd = tmp_.rolling(win_).apply(lambda x: fitcurve(x)).to_numpy()
print('pandas rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_pd).plot()

########
s_time = time.time()
roll_np = np.zeros(len(tmp_)-win_) ### Change
for cnt_ in range(len(tmp_)-win_):
    tmp1_ = tmp_[cnt_:cnt_+ win_]
    grad_ = np.linalg.lstsq(np.vstack([np.arange(win_), np.ones(win_)]).T, tmp1_, rcond = None)[0][0]
    roll_np[cnt_] = grad_ ### Change
    # roll_np = np.append(roll_np, grad_) ### Change

print('numpy rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_np).plot()

#################
s_time = time.time()
roll_st = np.empty(0)
from scipy import stats
for cnt_ in range(len(tmp_)-win_):
    slope, intercept, r_value, p_value, std_err = stats.linregress(np.arange(win_), tmp_[cnt_:cnt_ + win_])
    roll_st = np.append(roll_st, slope)
print('stats rolling time is', time.time() - s_time)
plt.show()
pd.Series(roll_st).plot()

I initialized the array from first place with the size of how it's expected to turn to be(len(tmp_)-win_ in range), and just assigned values to it later, and it was much faster.

there are also some other tips you can do, Python is interpreted language, meaning each time it takes a line, convert it to machine code, then execute it, and it does that for each line. Meaning if you can do multiple things at one line, meaning they will get converted at one time to machine code, it shall be faster, for example, think of list comprehension.

Ahmed Adel
  • 39
  • 1
  • 5
  • Note that list comprehension are faster, because there is not need to maintain a partial consistent python representation while the list is being build, while it is needed in a traditional for loop, since the partial list is a python object that may be accessed by the algorithm. This has nothing to do with the number of code lines. Codes is parsed when python file is read, not at execution. So, when the first `time.time()` occurs, there is no such thing a line of code. It is already a "compiled" internal representation in memory. – chrslg Jan 11 '23 at 15:08
  • So, I like concise code too. Especially in SO context, since it is always satisfying to answer with a one-liner. But trying to fit the code in one line to spare parsing time is a bad reason to do so, and would probably lead to unnecessary obfuscation. It is simply not how an interpreter works. An interpreter does NOT "take a line, convert it to machine code execute it, and then do the same for other lines". All that (reading lines, convert them to executable code — well, not exactly neither, more to an internal representation) is done before the first line of code is executed. – chrslg Jan 11 '23 at 15:11
  • thanks Ahmed, though I went with @chrslg answer, I really appreciate your help and effort as well. – Kiann Jan 12 '23 at 12:35