Sklearn LogisticRegression solver needs 2 classes of data

Question

I'm trying to run a Logistic Regression via sklearn:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import datetime as dt
import pandas as pd
import numpy as np
import talib
import matplotlib.pyplot as plt
import seaborn as sns

col_names = ['dates','prices']
# load dataset
df = pd.read_csv("DJI2.csv", header=None, names=col_names)

df.drop('dates', axis=1, inplace=True)
print(df.shape)
df['3day MA'] = df['prices'].shift(1).rolling(window = 3).mean()
df['10day MA'] = df['prices'].shift(1).rolling(window = 10).mean()
df['30day MA'] = df['prices'].shift(1).rolling(window = 30).mean()
df['Std_dev']= df['prices'].rolling(5).std()
df['RSI'] = talib.RSI(df['prices'].values, timeperiod = 9)
df['Price_Rise'] = np.where(df['prices'].shift(-1) > df['prices'], 1, 0)
df = df.dropna()

xCols = ['3day MA', '10day MA', '30day MA', 'Std_dev', 'RSI', 'prices']
X = df[xCols]
X = X.astype('int')
Y = df['Price_Rise']
Y = Y.astype('int')

logreg = LogisticRegression()

for i in range(len(X)):
   #Without this case below I get: ValueError: Found array with 0 sample(s) (shape=(0, 6)) while a minimum of 1 is required.
    if(i == 0): 
       continue
    logreg.fit(X[:i], Y[:i])

However, when i try to run this code I get the following error:

ValueError: 
This solver needs samples of at least 2 classes in the data, but the data contains only one class: 58

The shape of my X data is: (27779, 6) The shape of my Y data is: (27779,)

Here is a df.head(3) example to see what my data looks like:

     prices    3day MA  10day MA   30day MA   Std_dev        RSI  Price_Rise
30   58.11  57.973333    57.277  55.602333  0.247123  81.932338           1
31   58.42  58.043333    57.480  55.718667  0.213542  84.279674           1
32   58.51  58.216667    57.667  55.774000  0.249139  84.919586           0

I've tried searching for where I am getting this issue from myself, but I've only managed to find these two answers, both of which discuss the issue as a bug in sklearn, however they are both approx. two years old so I do not think that I am having the same issue.

1) You're passing your 'prices' variable as the dependent instead of 'Price_Rise' which appears to be your binary. 2) It appears that you're fitting the regression with only one row of data. 3) I suspect you may have intended another grouping variable in your for loop, as RSI appears to be a continuous variable. — Brandon Bertelsen, Feb 25 '19 at 19:42
4) You will be overwriting your fit on every loop unless you save it to a new object. — Brandon Bertelsen, Feb 25 '19 at 19:56
@BrandonBertelsen You're right about the binary data, thank you for pointing that out (made the edit above in my question as well), however I still get the same error. As far as my loop goes, I'm trying to do this: 1) for every element 'i' 2)fit logistic on Y '0' to Y 'i' and X '0' to X 'i' 3)Predict Y'i' (haven't added this yet as the fit is still giving error) — user3357738, Feb 25 '19 at 20:14
Notice you have updated, but still `range(len(X['RSI']))` would make it so that your first loop you would be fitting with one row of data. One row, means only one class! (0 or 1 - not both!) Hence the error message. Recommend a traditional train/test split in place of this for loop. — Brandon Bertelsen, Feb 25 '19 at 20:17
Add an assert(len(np.unique(Y[:i])) == 2) before you're calling the fit. See if you are getting an error. — Noam Peled, Feb 25 '19 at 20:19
Abundant examples here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html — Brandon Bertelsen, Feb 25 '19 at 20:21
Aware of train_test_split , however informed for whatever reason we cannot split our data between training and testing for this, the flow in my comment above is what we are asked to do, edited for loop to range(len(X)) and get same issue still. @NoamPeled I get AssertionError from this assert — user3357738, Feb 25 '19 at 20:24
Well, that means you have only one class in Y[:i] for that interation. You need to start with i such you'll have 2 unique values in Y[:i]. — Noam Peled, Feb 25 '19 at 20:27

score 0 · Accepted Answer · answered Feb 25 '19 at 20:30

0

You should make sure you have two unique values in Y[:i]. So before your loop, add something like:

starting_i = 0
for i in range(len(X)):
   if np.unique(Y[:i]) == 2:
      starting_i = i

Then just check that starting_i isn't 0 before running your main loop. Or even simpler, you can find the first occurrence where Y[i] != Y[0].

answered Feb 25 '19 at 20:30

Noam Peled

4,484
5
43
48

if i in range (0,3): continue Seems to do the trick, thank you – user3357738 Feb 25 '19 at 20:34
I would search for starting_i, so your solution won't be data dependent. – Noam Peled Feb 25 '19 at 20:41

score 0 · Answer 2 · answered Feb 25 '19 at 20:36

0

if i in range (0,3): 
    continue

Fixed this issue. Y[:i] was not unique before i = 3.

answered Feb 25 '19 at 20:36

user3357738

67
1
2
8

Sklearn LogisticRegression solver needs 2 classes of data

2 Answers2