I am trying to perform multivariate linear regression on array data that is larger than memory. I am wondering how I should iterate a dask_ml
linear regression function on a multidimensional dask array.
On small enough data, I can use sklearn.linear_model.LinearRegression
or sklearn.linear_model.Ridge
(with alpha=0.0
), as these functions can take a multidimensional y
, with shape (n_samples, n_targets)
. The problem can be looked at as performing linear regression n_targets
times.
Specifically, I am looking at using dask_ml.linear_model.LinearRegression
(but I am open to suggestions for alternatives). However, this function only takes 1-dimensional y
. I could consider using a for-loop, but this seems like a very slow and inefficient approach.
What is a better way of doing this?
As a bonus question: I observe that the output .coef
of dask_ml.linear_model.LinearRegression
is a numpy array, which implies that it is eagerly executed. Is there a reason it is not returned as a computable dask array?
import dask.array as da
n_samples = 1024
n_features = 20
n_targets = 50 # this number is much larger in real life, around 1e6 to 1e8
# generate some random data
X = da.random.random((n_samples, n_features))
y = da.random.random((n_samples, n_targets))
# "regular" non-dask way of doing it, will result in MemoryError for large data
from sklearn.linear_model import LinearRegression
LR1 = LinearRegression()
LR1.fit(X, y)
LR1.coef_ # intended result, with shape (n_targets, n_features)
# very slow attempt at a dask version, but A) for loop is slow, B) coef output from function is numpy array
from dask_ml.linear_model import LinearRegression
LR2 = LinearRegression(C=999999) # seting regularizer 1/C to zero
coef_ = []
for i in range(n_targets):
c = LR2.fit(X, y[:,i]).coef_
coef_.append(c)
coef_ = da.asarray(coef_)