ERROR: ValueError: Must pass DataFrame with boolean values only when I try to do a scatter plot using statsmodels

Question

I'm new to Python and the relative question I read didn't make much sense to me. I have the following issue. I want to use Python to do multiple regression and I am trying statsmodels. In this case I want to do a scatter plot.

Sample of my data:

ID  order  V1     V2    E1  E2  E3   M
103  1    ECA    TEXT    7   3   5   7
105  1    ECA    TEXT    3   7   4   5
107  1    ECA    TEXT    7   7   7   4
109  1    ECA    TEXT    6   6   6   3

I want to do a multiple regression with E1-E3 as my IVs and the mean score of M as my DV.

This is how I loaded my data.

myRegressionData = pd.read_csv('C:/Users/user/Desktop/Folder 1/Python/Regression data file.csv')

These are my x and y:

X_sk = myRegressionData[[col for col in myRegressionData.columns if col[:8] == 'E']]

Y = myRegressionData[['M{}'.format(ii) for ii in range(1, 19)]]
y = np.mean(Y, axis=1)

and this the code where I get the error:

myRegressionData.plot(kind='scatter',x = X_sk, y=np.mean(Y, axis=1))

returns

ValueError: Must pass DataFrame with boolean values only

myRegressionData.info()

returns

RangeIndex: 90 entries, 0 to 89 Columns: 146 entries, IDOpenEndedResponse to EngagingAA dtypes: float64(10), int64(134), object(2) memory usage: 102.7+ KB

Without your data, we can't fully resolve this issue, for instance what does `myRegressionData.info()` show? Does it bork if you just did `myRegressionData.plot.scatter(x = X_sk, y = np.mean(Y, axis=1))` or just `myRegressionData.plot().bar()` — EdChum, Apr 11 '17 at 13:25
@EdChum that would be: RangeIndex: 90 entries, 0 to 89 Columns: 146 entries, IDOpenEndedResponse to EngagingAA dtypes: float64(10), int64(134), object(2) memory usage: 102.7+ KB . No it doesn't work, I get the same error. — Danai, Apr 11 '17 at 13:27
Please edit this into your question, besides the value error maybe bogus as a scatter plot can handle more than boolean values so you need to post a minimal example that includes raw data, your code to load the df, all code that reproduces the error and the desired output — EdChum, Apr 11 '17 at 13:30
Still need raw data unfortunately, you could have dodgy data — EdChum, Apr 11 '17 at 13:37
In your example data you only have one `M` target column but aggregate over across 18 target columns. Maybe you should indicate this by adding another `M` column in your sample data so that you have, say, `M1` and `M2`. Also you should probably update your code to use `col[:1] == 'E'` and `range(1, 3)`, or similar. Also remember that `range` is inclusive at the start, but not at the end. — André C. Andersen, Apr 11 '17 at 14:13

André C. Andersen · Answer 1 · 2017-04-11T14:32:22.953

In the following:

myRegressionData.plot(kind='scatter',x = X_sk, y=np.mean(Y, axis=1))

x and y expect column names, or indecies. X_sk and np.mean(Y, axis=1) is data. Supply the column names or use your plotter directly.

Example where we use matplotlib:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

myRegressionData = pd.DataFrame([
    {'a0': 4, 'a1': 3, 'b0': 2, 'b1': 1}, 
    {'a0': 3, 'a1': 1, 'b0': 4, 'b1': 1}, 
    {'a0': 1, 'a1': 2, 'b0': 3, 'b1': 1}
])

X_sk = myRegressionData[[col for col in myRegressionData.columns if col[:1] == 'b']]
Y = myRegressionData[['a{}'.format(ii) for ii in range(0,2)]]
plt.scatter(X_sk['b0'], np.mean(Y, axis=1))

The example should be a simplified version of what you're doing.

If you insist on using the pandas DataFrame plotter you can do something like this:

y = pd.DataFrame(np.mean(Y, axis=1), columns=['y'])
df = pd.concat([X_sk, y], axis=1)
df.plot(kind='scatter', x='b0', y='y')

Having many X values, but only one y value and differentiate using colors:

X_sk = myRegressionData[[col for col in myRegressionData.columns if col[:1] == 'b']]
Y = myRegressionData[['a{}'.format(ii) for ii in range(0,2)]]
y = pd.DataFrame(np.mean(Y, axis=1))
yy = pd.concat([y, y])
plt.scatter(X_sk, yy, c=['b', 'r'])

Final alternative using scatter_matrix:

y = pd.DataFrame(np.mean(Y, axis=1), columns=['y'])
df = pd.concat([X_sk, y], axis=1)
scatter_matrix(df, alpha=0.2, figsize=(6, 6))

So that works but only with one column of X_SK at a time otherwise I get the error x and y must have the same size. — Danai, Apr 11 '17 at 14:12
`model = sm.OLS(y, X_sk)` `results = model.fit()` #Create a plot `fig, ax = plt.subplots()` `fig = sm.graphics.plot_fit(results, 0, ax=ax)` `plt.show()` Would that work? — Danai, Apr 11 '17 at 14:14
Se my update. Just add `y` several times. Other than that I recommend having a look at this question about scatterplot matrices: http://stackoverflow.com/questions/7941207/is-there-a-function-to-make-scatterplot-matrices-in-matplotlib — André C. Andersen, Apr 11 '17 at 14:23

ERROR: ValueError: Must pass DataFrame with boolean values only when I try to do a scatter plot using statsmodels

1 Answers1