Sklearn Error, array with 4 dim. Estimator <=2

Question

I been trying import data from yahoo finance via panda then convert it to arrays via .as_matrix(), then as i input the data into the classifer to train, it gives me an error.

ValueError: Found array with dim 4. Estimator expected <= 2.

This below is my code:

from sklearn import tree
import pandas as pd
import pandas_datareader.data as web

df = web.DataReader('goog', 'yahoo', start='2012-5-1', end='2016-5-20')

close_price = df[['Close']]

ma_50 = (pd.rolling_mean(close_price, window=50))
ma_100 = (pd.rolling_mean(close_price, window=100))
ma_200 = (pd.rolling_mean(close_price, window=200))

#adding buys and sell based on the values
df['B/S']= (df['Close'].diff() < 0).astype(int)
close_buy = df[['Close']+['B/S']]
closing = df[['Close']].as_matrix()
buy_sell = df[['B/S']]


close_buy = pd.DataFrame.dropna(close_buy, 0, 'any')
ma_50 = pd.DataFrame.dropna(ma_50, 0, 'any')
ma_100 = pd.DataFrame.dropna(ma_100, 0, 'any')
ma_200 = pd.DataFrame.dropna(ma_200, 0, 'any')

close_buy = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
ma_50 = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
ma_100 = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
ma_200 = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
buy_sell = (df.loc['2013-02-15':'2016-05-21']).as_matrix

print(ma_100)
clf = tree.DecisionTreeClassifier()
x = [[close_buy,ma_50,ma_100,ma_200]]
y = [buy_sell]

clf.fit(x,y)

piRSquared · Accepted Answer · 2016-05-21T10:06:24.787

I found a couple of bugs/things needing fixing.

Missing parantheses buy_sell = (df.loc['2013-02-15':'2016-05-21']).as_matrix
[[close_buy,ma_50,ma_100,ma_200]] is what gives you your 4 dimensions. Instead, I'd use np.concatenate which takes a list of arrays and appends them to each other either length wise or width wise. the parameter axis=1 specifies width wise. What this does is make x an 822 x 28 matrix of 822 observations of 28 features. If this isn't what you were going for, then clearly I didn't hit the mark. But those dimensions line up with your y.

Instead:

from sklearn import tree
import pandas as pd
import pandas_datareader.data as web

df = web.DataReader('goog', 'yahoo', start='2012-5-1', end='2016-5-20')

close_price = df[['Close']]

ma_50 = (pd.rolling_mean(close_price, window=50))
ma_100 = (pd.rolling_mean(close_price, window=100))
ma_200 = (pd.rolling_mean(close_price, window=200))

#adding buys and sell based on the values
df['B/S']= (df['Close'].diff() < 0).astype(int)
close_buy = df[['Close']+['B/S']]
closing = df[['Close']].as_matrix()
buy_sell = df[['B/S']]


close_buy = pd.DataFrame.dropna(close_buy, 0, 'any')
ma_50 = pd.DataFrame.dropna(ma_50, 0, 'any')
ma_100 = pd.DataFrame.dropna(ma_100, 0, 'any')
ma_200 = pd.DataFrame.dropna(ma_200, 0, 'any')

close_buy = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
ma_50 = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
ma_100 = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
ma_200 = (df.loc['2013-02-15':'2016-05-21']).as_matrix()
buy_sell = (df.loc['2013-02-15':'2016-05-21']).as_matrix()  # Fixed

print(ma_100)
clf = tree.DecisionTreeClassifier()
x = np.concatenate([close_buy,ma_50,ma_100,ma_200], axis=1)  # Fixed
y = buy_sell  # Brackets not necessary... I don't think

clf.fit(x,y)

This ran for me:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

So when this runs, will it be, the price, ma_50,ma_100,ma_200. Will those data be fed into the clf as one — sam202252012, May 21 '16 at 10:14
the first 7 columns of x are identical to `close_buy`. The next 7 are identical to `ma_50' and so on. So... yes. — piRSquared, May 21 '16 at 10:18
The underlying arrays, `close_buy`, `ma_50` and such are already `np.arrays`. It seemed a natural fit. The answer is yes, it's possible but it would be cumbersome and are you sure you want that? — piRSquared, May 21 '16 at 10:19
i gives me another error, which is ValueError: Unknown label type: array([[ 7.87401353e+02, 7.93261381e+02, 7.87071324e+02, ..., 5.48000000e+06, 3.96049623e+02, 0.00000000e+00], [ 7.95991368e+02, 8.07001373e+02, 7.95281379e+02, ..., 5.88550000e+06, 4.03022676e+02, 0.00000000e+00], [ 8.05301357e+02, 8.08971379e+02, 7.91791350e+02, ..., 5.54900000e+06, 3.95834832e+02, 1.00000000e+00], — sam202252012, May 21 '16 at 10:25
I just re-verified the code I have posted. It still works for me. What versions of sklearn `import sklearn; sklearn.__version__` are you using? I have 0.16.1. — piRSquared, May 21 '16 at 10:30

Sklearn Error, array with 4 dim. Estimator <=2

1 Answers1

Linked