I have a 1D array of independent variable values (x_array
) that match the timesteps in a 3D numpy array of spatial data with multiple time-steps (y_array
). My actual data is much larger: 300+ timesteps and up to 3000 * 3000 pixels:
import numpy as np
from scipy.stats import linregress
# Independent variable: four time-steps of 1-dimensional data
x_array = np.array([0.5, 0.2, 0.4, 0.4])
# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2, -0.2, -0.3],
[-0.3, -0.2, -0.3],
[-0.3, -0.4, -0.4]],
[[-0.2, -0.2, -0.4],
[-0.3, np.nan, -0.3],
[-0.3, -0.3, -0.4]],
[[np.nan, np.nan, -0.3],
[-0.2, -0.3, -0.7],
[-0.3, -0.3, -0.3]],
[[-0.1, -0.3, np.nan],
[-0.2, -0.3, np.nan],
[-0.1, np.nan, np.nan]]])
I want to compute a per-pixel linear regression and obtain R-squared, P-values, intercepts and slopes for each xy
pixel in y_array
, with values for each timestep in x_array
as my independent variable.
I can reshape to get the data in a form to input it into np.polyfit
which is vectorised and fast:
# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)
# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)
However, this ignores pixels that contain any NaN
values (np.polyfit
does not support NaN
values), and does not calculate the statistics I require (R-squared, P-values, intercepts and slopes).
The answer here uses scipy.stats import linregress
which does calculate the statistics I need, and suggests avoiding NaN
issues by masking out these NaN
values. However, this example is for two 1D arrays, and I can't work out how to apply a similar masking approach to my case where each column in y_array_reshaped
will have a different set of NaN
values.
My question: How can I calculate regression statistics for each pixel in a large multidimensional array (300 x 3000 x 3000) containing many NaN
values in a reasonably fast, vectorised way?
Desired result: A 3 x 3 array of regression statistic values (e.g. R-squared) for each pixel in y_array
, even if that pixel contains NaN
values at some point in the time series