2

I have a speed/readability improvement question. I have N time series of length T in a matrix Y (dim=TxN). I also have a 3D Matrix X which is TxNxK.

The data has some random NaN values.

Given a regression window (W), the goal is to create a prediction of Y using data as of X. Taking into account that for any single Y time series, the regression should be over the last available W variable values. This means you need all the X variables and the corresponding series Y variables but you don't care about the other Y variables.

I'm able to do that with the code below, but I feel there might be a way to remove the loops. I tried using map and functions but I get similar timeit values and less readability.

import random
import numpy as np
from numpy.linalg import inv

# Parameters
N = 500  #Number of time series
T = 1000 #Length of each time series
W = 72   #Regression window
K = 3    #Numer of independent variables

Y = np.random.randn(T, N)
X = np.random.randn(T, N, K)

# Add the constants
X = np.concatenate((X, np.ones((T, N, 1))), axis=2)

def get_rand_arr(arr, frac_rand=0.0001):
    ix = [(row, col) for row in range(arr.shape[0]) for col in range(arr.shape[1])]
    for row, col in random.sample(ix, int(round(frac_rand*len(ix)))):
        arr[row, col] = np.nan
    return arr

# Insert some NaN values - like the real world - I dont care about this loop
Y = get_rand_arr(Y)
for i in range(X.shape[2]):
    X[:, :, i] = get_rand_arr(X[:, :, i])

X_mask = np.apply_along_axis(np.any, 1, np.apply_along_axis(np.any, 2, np.isnan(X)))
Y_mask = np.concatenate([np.logical_or(np.isnan(Y)[:, i],X_mask).reshape(-1,1) for i in range(N)],axis=1)

Y_hat = np.NaN*np.zeros((T, N))
for j in range(N):
    y = Y[~Y_mask[:, j], j]
    x = X[~Y_mask[:, j], j, :]
    y_hat = np.NaN*np.zeros(y.shape[0])
    for i in range(y_hat.shape[0]-W):
        y_hat[i+W] = x[i+W, :].dot(inv(x[i:i+W, :].T.dot(x[i:i+W, :])).dot(x[i:i+W, :].T.dot(y[i:i+W])))
    Y_hat[~Y_mask[:, j], j] =  y_hat

I get the following timeit results

%%timeit
Y_hat = np.NaN*np.zeros((T, N))
for j in range(N):
    y = Y[~Y_mask[:, j], j]
    x = X[~Y_mask[:, j], j, :]
    y_hat = np.NaN*np.zeros(y.shape[0])
    for i in range(y_hat.shape[0]-W):
        y_hat[i+W] = x[i+W, :].dot(inv(x[i:i+W, :].T.dot(x[i:i+W, :])).dot(x[i:i+W, :].T.dot(y[i:i+W])))
    Y_hat[~Y_mask[:, j], j] =  y_hat
9.5 s ± 373 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The time series is long enough and the regression window is small enough that I don't realistically need to worry about making sure I have enough values to run at least 1 regression.

Sahil Puri
  • 491
  • 3
  • 12
  • 2
    If you have to iterate, use `nditer`. However, if you put your logic into a function, you can use `apply_along_axis` or `vectorize` to eliminate the for loops. The docs are here: https://docs.scipy.org/doc/numpy/reference/arrays.nditer.html and https://docs.scipy.org/doc/numpy/reference/generated/numpy.apply_along_axis.html and https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html – Evan Rosica Jun 09 '19 at 12:11
  • 1
    Thanks, @EvanRosica! Will read and give this another stab. – Sahil Puri Jun 09 '19 at 12:28
  • 2
    You're welcome. This may also be of interest: "Most efficient way to map function over numpy array": https://stackoverflow.com/questions/35215161/most-efficient-way-to-map-function-over-numpy-array – Evan Rosica Jun 09 '19 at 12:31
  • 1
    This is also very useful. The only issue I see is that my Y-hat values are coming from different slices of different numpy arrays. Most of the examples work well in case of dealing with a single numpy object. – Sahil Puri Jun 09 '19 at 12:38
  • Perhaps you could concatenate the relevant slices into a single array, then use some of the methods? – Evan Rosica Jun 09 '19 at 12:41
  • Don't use `nditer`! It isn't faster (in Python code), and generally not easier to use. Also stay away from `apply_along_axis` if you are interested in speed. `np.random` lets you generate many random values at once, without iteration. Another quick observation, `np.inv` can work with multiple square arrays; `matmul` might also help in that expression. – hpaulj Jun 09 '19 at 15:34
  • `np.vectorize` does not help with speed. – hpaulj Jun 09 '19 at 15:46
  • np.any takes an axis parameter, integer or tuple – hpaulj Jun 09 '19 at 19:16

0 Answers0