3

I am trying to perform multivariate linear regression on array data that is larger than memory. I am wondering how I should iterate a dask_ml linear regression function on a multidimensional dask array.

On small enough data, I can use sklearn.linear_model.LinearRegression or sklearn.linear_model.Ridge (with alpha=0.0), as these functions can take a multidimensional y, with shape (n_samples, n_targets). The problem can be looked at as performing linear regression n_targets times.

Specifically, I am looking at using dask_ml.linear_model.LinearRegression (but I am open to suggestions for alternatives). However, this function only takes 1-dimensional y. I could consider using a for-loop, but this seems like a very slow and inefficient approach. What is a better way of doing this?

As a bonus question: I observe that the output .coef of dask_ml.linear_model.LinearRegression is a numpy array, which implies that it is eagerly executed. Is there a reason it is not returned as a computable dask array?

import dask.array as da

n_samples = 1024
n_features = 20
n_targets = 50 # this number is much larger in real life, around 1e6 to 1e8

# generate some random data
X = da.random.random((n_samples, n_features))
y = da.random.random((n_samples, n_targets))

# "regular" non-dask way of doing it, will result in MemoryError for large data
from sklearn.linear_model import LinearRegression

LR1 = LinearRegression()
LR1.fit(X, y)
LR1.coef_ # intended result, with shape (n_targets, n_features)

# very slow attempt at a dask version, but A) for loop is slow, B) coef output from function is numpy array
from dask_ml.linear_model import LinearRegression

LR2 = LinearRegression(C=999999) # seting regularizer 1/C to zero
coef_ = []
for i in range(n_targets):
    c = LR2.fit(X, y[:,i]).coef_
    coef_.append(c)
coef_ = da.asarray(coef_)
TomNorway
  • 2,584
  • 1
  • 19
  • 26

0 Answers0