34

I have two arrays, say varx and vary. Both contain NaN values at various positions. However, I would like to do a linear regression on both to show how much the two arrays correlate. This was very helpful so far.

However, using the following

slope, intercept, r_value, p_value, std_err = stats.linregress(varx, vary)

results in NaNs for every output variable. What is the most convenient way to take only valid values from both arrays as input to the linear regression? I heard about masking arrays, but am not sure how it works exactly.

cottontail
  • 10,268
  • 18
  • 50
  • 51
HyperCube
  • 3,870
  • 9
  • 41
  • 53

2 Answers2

49

You can remove NaNs using a mask:

mask = ~np.isnan(varx) & ~np.isnan(vary)
slope, intercept, r_value, p_value, std_err = stats.linregress(varx[mask], vary[mask])
ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • Works perfectly! Didn't know the ~ operator means "is not". – HyperCube Nov 30 '12 at 10:44
  • 3
    @HyperCube careful with that, it only means "is not" for NumPy arrays (it's an abuse of the normal meaning, which is the bitwise not operator). See http://stackoverflow.com/questions/13600988/python-tilde-unary-operator-as-negation-numpy-bool-array/13602395#13602395 – ecatmur Nov 30 '12 at 10:52
  • To be fair, it's not that much of an abuse if masks are thought of in the old-school sense of bitmasks – acjay Nov 30 '12 at 11:32
  • 5
    You can also keep it positive using `mask = np.isfinite(varx) & np.isfinite(vary)`. Of course this changes the meaning slightly to also exclude infinites. – SpinUp __ A Davis Aug 04 '15 at 15:49
  • 2
    @ecatmur, what happen if only vary contains some nan? when I tried to apply the method you suggest, i have the following error: ValueError: all the input array dimensions except for the concatenation axis must match exactly – user3841581 Nov 05 '15 at 15:59
  • @user3841581 that would indicate that `varx` and `vary` are different sizes, even before removing nans. – ecatmur Nov 05 '15 at 17:16
1

It's not relevant for linregress because it only admits 1-D arrays anyways but if x is 2-D and you're building a linear regression model using sklearn.linear_model.LinearRegression/statsmodels.api.OLS etc., then it's necessary to drop NaNs row-wise.

m = ~(np.isnan(x).any(axis=1) | np.isnan(y))
x_m, y_m = x[m], y[m]

In the above example, any() reduces the 2-D mask into a 1-D mask, which can be used to remove rows.

A working example may look like as follows.

import numpy as np
from sklearn.linear_model import LinearRegression
# sample data
x = np.random.default_rng(0).normal(size=(100,5))    # x is shape (100,5)
y = np.random.default_rng(0).normal(size=100)
# add some NaNs
x[[10,20], [1,3]] = np.nan
y[5] = np.nan


lr = LinearRegression().fit(x, y)             # <---- ValueError

m = ~(np.isnan(x).any(axis=1) | np.isnan(y))  
x_m, y_m = x[m], y[m]                         # remove NaNs
lr = LinearRegression().fit(x_m, y_m)         # <---- OK

With statsmodels, it's even easier because its models (e.g. OLS, Logit, GLM etc.) have a keyword argument missing= that can be used to drop NaNs under the hood.

import statsmodels.api as sm
model = sm.OLS(y, x, missing='drop').fit()
model.summary()
cottontail
  • 10,268
  • 18
  • 50
  • 51