1

I asked a question about the same problem earlier, but because my approach has changed I now have different questions.

My current code:

from sklearn import preprocessing
from openpyxl import load_workbook
import numpy as np
from numpy import exp, array, random, dot
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
#Set sizes
rowSize = 200
numColumns = 4

# read  from excel file
wb = load_workbook('python_excel_read.xlsx')
sheet_1 = wb["Sheet1"]

date = np.zeros(rowSize)
day = np.zeros(rowSize)
rain = np.zeros(rowSize)
temp = np.zeros(rowSize)
out = np.zeros(rowSize)

for i in range(0, rowSize):
    date[i] = sheet_1.cell(row=i + 1, column=1).value
    day[i] = sheet_1.cell(row=i + 1, column=2).value
    rain[i] = sheet_1.cell(row=i + 1, column=3).value
    temp[i] = sheet_1.cell(row=i + 1, column=4).value
    out[i] = sheet_1.cell(row=i + 1, column=5).value

train = np.zeros(shape=(rowSize,numColumns))
t_o = np.zeros(shape=(rowSize,1))

for i in range(0, rowSize):
    train[i] = [date[i], day[i], rain[i], temp[i]]
    t_o[i] = [out[i]]


X = train
# Output
y = t_o

X_train, X_test, y_train, y_test = train_test_split(X, y)

####Neural Net
nn = MLPRegressor(
    hidden_layer_sizes=(3,),  activation='relu', solver='adam', alpha=0.001, batch_size='auto',
    learning_rate='constant', learning_rate_init=0.01, power_t=0.5, max_iter=10000, shuffle=True,
    random_state=9, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True,
    early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
nn.fit(X_train, y_train.ravel())


y_pred = nn.predict(X_test)

###Linear Regression
# lm = LinearRegression()
# lm.fit(X_train,y_train)
# y_pred = lm.predict(X_test)

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(X_test[:,0], y_pred, s=1, c='b', marker="s", label='real')
ax1.scatter(X_test[:,0], y_test, s=10, c='r', marker="o", label='NN Prediction')
plt.show()

#Calc MSE
mse = np.square(y_test-y_pred).mean()

print(mse)

The results from this show a pretty bad prediction of the test data. Because I am new to this, I am not sure if it is my data, the model, or my coding. Based on the plot, I believe the model is wrong for the data (the model seems to predict something near linear or squared, while the actual data seems much more spread out)

Here are some of the data points: formatted as Day of year(2 is jan 2nd), weekday(1)/weekend(0), rain(1)/no rain(0), Temp in F, attendance (this is output)

2   0   0   51  1366
4   0   0   62  538
5   1   0   71  317
6   1   0   76  174
7   1   0   78  176
8   1   0   68  220
12  1   1   64  256
13  1   1   60  379
14  1   0   64  316
18  0   0   72  758
19  1   0   72  1038
20  1   0   72  405
21  1   0   71  326
24  0   0   74  867
26  1   1   68  521
27  1   0   71  381
28  1   0   72  343
29  1   1   68  266
30  0   1   57  479
31  0   1   57  717
33  1   0   70  542
34  1   0   73  220
35  1   0   74  360
36  1   0   79  444
42  1   0   78  534
45  0   0   80  1572
52  0   0   76  1236
55  1   1   64  689
56  1   0   69  726
59  0   0   67  1188
60  0   0   74  1140
61  1   1   63  979
62  1   1   62  657
63  1   0   67  687
64  1   0   72  615
67  0   0   80  1074
68  1   0   81  1261
71  1   0   83  1332
73  0   0   85  1259
74  0   0   86  1142
76  1   0   88  1207
77  1   1   78  1438
82  1   0   85  1251
83  1   0   83  1019
85  1   0   86  1178
86  0   0   92  1306
87  0   0   92  1273
89  1   0   93  1101
90  1   0   92  1274
93  0   0   83  1548
94  0   0   86  1318
96  1   0   83  1395
97  1   0   81  1338
98  1   0   75  1240
100 0   0   84  1335
102 0   0   83  931
103 1   0   87  746
104 1   0   91  746
105 1   0   81  600
106 1   0   72  852
108 0   1   87  1204
109 0   0   89  1191
110 1   0   90  769
111 1   0   88  642
112 1   0   86  743
114 0   1   75  1085
115 0   1   78  1109
117 1   0   84  871
120 1   0   96  599
123 0   0   93  651
129 0   0   74  1325
133 1   0   88  637
134 1   0   84  470
135 0   1   73  980
136 0   0   72  1096
137 0   0   83  792
138 1   0   87  565
139 1   0   84  501
141 1   0   88  615
142 0   0   79  722
143 0   0   80  1363
144 0   0   82  1506
146 1   0   93  626
147 1   0   94  415
148 1   0   95  596
149 0   0   100 532
150 0   0   102 784
154 1   0   99  514
155 1   0   94  495
156 0   1   87  689
157 0   1   94  931
158 0   0   97  618
161 1   0   92  451
162 1   0   97  574
164 0   0   102 898
165 0   0   104 746
166 1   0   109 587
167 1   0   109 465
174 1   0   108 514
175 1   0   109 572
179 0   0   107 811
181 1   0   104 423
182 1   0   103 526
184 0   1   97  849
185 0   0   103 852
189 1   0   106 728
191 0   0   101 577
194 1   0   105 511
198 0   1   101 616
199 0   1   97  1056
200 0   0   94  740
202 1   0   103 498
205 0   0   101 610
206 0   0   106 944
207 0   0   105 769
208 1   0   103 551
209 1   0   103 624
210 1   0   97  513
212 0   1   107 561
213 0   0   100 905
214 0   0   105 767
215 1   0   107 510
216 1   0   108 406
217 1   0   109 439
218 1   0   103 427
219 0   1   104 460
224 1   0   105 213
227 0   0   112 834
228 0   0   109 615
229 1   0   105 216
230 1   0   104 213
231 1   0   104 256
232 1   0   104 282
235 0   0   104 569
238 1   0   103 165
239 1   1   105 176
241 0   1   108 727
242 0   1   105 652
243 1   1   103 231
244 1   0   96  117
245 1   1   98  168
246 1   1   97  113
247 0   0   95  227
248 0   0   92  1050
249 0   0   101 1274
250 1   1   95  1148
254 0   0   99  180
255 0   0   104 557
258 1   0   94  228
260 1   0   95  133
263 0   0   100 511
264 1   1   89  249
265 1   1   90  245
267 1   0   101 390
272 1   0   100 223
273 1   0   103 194
274 1   0   103 150
275 0   0   95  224
276 0   0   92  705
277 0   1   92  504
279 1   1   77  331
281 1   0   89  268
284 0   0   95  566
285 1   0   94  579
286 1   0   95  420
288 1   0   93  392
289 0   1   94  525
290 0   1   86  670
291 0   1   89  488
294 1   1   74  295
296 0   0   81  314
299 1   0   88  211
301 1   0   84  246
303 0   1   76  433
304 0   0   80  216
307 1   1   80  275
308 1   1   66  319
312 0   0   80  413
313 1   0   78  278
316 1   0   74  305
320 1   1   57  323
324 0   0   76  220
326 0   0   77  461
327 1   0   78  510
331 0   0   60  1701
334 1   0   58  237
335 1   0   62  355
336 1   0   68  266
338 0   0   70  246
342 1   0   72  109
343 1   0   70  103
347 0   0   58  486
349 1   0   52  144
350 1   0   53  209
351 1   0   55  289
354 0   0   62  707
355 1   0   59  903
359 0   0   58  481
360 0   0   53  1342
364 1   0   57  1624

I have over a thousand data points in total, but Im not using them all for training/testing. One thought is I need more, another is that I need more factors because temp/rain/day of week does not affect attendance enough.

Here is the plot: Test Data

What can I do to make my model more accurate and give better predictions?

Thanks

EDIT: I added more data points and another factor. I cant seem to upload the excel file so I put the data on here with a better explanation of how it is formatted

EDIT: Here is the most recent code:

from sklearn import preprocessing
from openpyxl import load_workbook
import numpy as np
from numpy import exp, array, random, dot
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import LeaveOneOut
#Set sizes
rowSize = 500
numColumns = 254

# read  from excel file
wb = load_workbook('python_excel_read.xlsx')
sheet_1 = wb["Sheet1"]

input = np.zeros(shape=(rowSize,numColumns))
out = np.zeros(rowSize)
for i in range(0, rowSize):
    for j in range(0,numColumns):
        input[i,j] = sheet_1.cell(row=i + 1, column=j+1).value
    out[i] = sheet_1.cell(row=i + 1, column=numColumns+1).value

output = np.zeros(shape=(rowSize,1))

for i in range(0, rowSize):
    output[i] = [out[i]]


X = input
# Output
y = output

print(X)
print(y)
y[y < 500] = 0
y[np.logical_and(y >= 500, y <= 1000)] = 1
y[np.logical_and(y > 1000, y <= 1200)] = 2
y[y > 1200] = 3

# Use cross-validation
#kf = KFold(n_splits = 10, random_state=0)
loo = LeaveOneOut()
# Try different models
clf = svm.SVC()
scaler = StandardScaler()
pipe = Pipeline([('scaler', scaler), ('svc', clf)])

accuracy = cross_val_score(pipe, X, y.ravel(), cv = loo, scoring = "accuracy")
print(accuracy.mean())

#y_pred = cross_val_predict(clf, X, y.ravel(), cv = kf)
#cm = confusion_matrix(y, y_pred)

and here is the up to date data with as many features as I could add. note this is a random sample from the full data:

Link to sample data

Current output: 0.6230954290296712

My ultimate goal is to achieve 90% or higher accuracy... I dont believe I can find more features, but will continue to gather as many as possible if helpful

seralouk
  • 30,938
  • 9
  • 118
  • 133
ChrisM
  • 125
  • 14
  • this task probably fits well using a linear regression, what makes you do a neural network? – Evgeny Jun 14 '18 at 00:22
  • It seemed like a good opportunity to learn them. Also, I assumed this wouldnt be linear but it has been 2 years since my stats class, maybe I am misunderstanding it – ChrisM Jun 14 '18 at 03:02
  • @ChrisM can you add the `python_excel_read.xlsx` data. then I could provide an answer how to increase the prediction performance. – seralouk Jun 14 '18 at 09:04
  • @seralouk I added more data to the question and uploaded my most recent code/results. Im not sure how to upload the file since SO doesnt have a file hosting service. I suppose I could use one and give the link if needed – ChrisM Jun 14 '18 at 17:33
  • perfect. Your goal is to predict the output. Have you tried to use anything else expect MLPRegressor ? – seralouk Jun 14 '18 at 18:03
  • @seralouk I have tried excels forecast sheet, which did a decent job, but I can only figure out how to use that with a single predictor like date. I also (accidentally) tried MLPClassifier (I think thats what it was), which didnt work. Besides the MLPRegressor and Linear regression I havent tried anything else. Oh I also tried creating my own Neural net (see my other question) – ChrisM Jun 14 '18 at 18:19
  • Have you tried to normalize/scale the data before the model fitting? If not, try MinMaxScaler or StandardScaler and check if the mse is decreased – seralouk Jun 14 '18 at 18:39
  • @seralouk just tried using MinMax, using `scaler = MinMaxScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)` but the mse didnt decrease much if at all (183770) and the fit seems about the same – ChrisM Jun 14 '18 at 18:51

1 Answers1

1

Your question is really general, however I have some suggestions. You could use cross-validation and try different models. Personnaly, I would try SVR,RandomForests and as last choice I would use a MLPR.

I modified a bit your code to show a simple example:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import LeaveOneOut
import pandas as pd
from sklearn.decomposition import PCA

# read the data
df = pd.read_excel('python_excel_read.xlsx', header = None)
rows, cols = df.shape

X = df.iloc[: , 0:(cols - 1)]
y = df.iloc[: , cols - 1 ]
print(X.shape)
print(y.shape)

y[y < 500] = 0
y[np.logical_and(y >= 500, y <= 1000)] = 1
y[np.logical_and(y > 1000, y <= 1200)] = 2
y[y > 1200] = 3
print(np.unique(y))

# We can apply PCA to reduce the dimensions of the data
# pca = PCA(n_components=2)
# pca.fit(X)
# X = pca.fit_transform(X)

# Use cross-validation
#kf = KFold(n_splits = 10, random_state=0)
loo = LeaveOneOut()
# Try different models
clf = svm.SVC(kernel = 'linear')
scaler = StandardScaler()
pipe = Pipeline([('scaler', scaler), ('svc', clf)])

accuracy = cross_val_score(pipe, X, y.ravel(), cv = loo, scoring = "accuracy")
print(accuracy.mean())

#y_pred = cross_val_predict(clf, X, y.ravel(), cv = kf)
#cm = confusion_matrix(y, y_pred)

enter image description here

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • I tried this out, and following your cheat sheet (very useful btw) I also tried rigidregression,ensembleregression, and lasso. All give MSE around -120k. Is MSE a bad measure for what I am trying to do? What other models do you think I should attempt? – ChrisM Jun 15 '18 at 16:46
  • I would also calculate RMSE and if this is bad too for these models, then I would move to more complex models (neural nets, non-linear models, PCA as preprocessing). I will try to play with your data and let you know – seralouk Jun 15 '18 at 16:47
  • Thank you. From my testing, Lasso gives me the (marginally) best results with an mse with abs value of 125k. I will try some of the others you suggested now – ChrisM Jun 15 '18 at 22:13
  • I should probably mention my ultimate goal is to predict within 10% of actual attendance. Whether or not this is possible/will happen I dont know, but as close as I can get will be great. I actually added another feature to my data and got something with a pretty good shape using MLPRegression from before, but the predictions are still off by too much – ChrisM Jun 15 '18 at 23:03
  • To achieve so high performance, I strongly believe that you need more features (and hopefully, informative features). – seralouk Jun 16 '18 at 08:09
  • Just to be clear, features are inputs that affect the output right? So in my case a feature is things like temperature, weather, etc? Also, would it be possible/easier/more accurate if I formed this as a classification problem where it just predicts weather attendence will be above or below certain amounts? – ChrisM Jun 18 '18 at 04:01
  • Hello. Yes, features are the parameters/variables like day, rain etc. If you turn the problem in classification it is possible to boost the performance. can you split attendance column based on a threshold ? – seralouk Jun 18 '18 at 06:46
  • I should be able to, is it possible to split it into, say 3 categories? I will check it out and post my results – ChrisM Jun 18 '18 at 16:04
  • Yes of course. If it makes sense, you can split the column "attendance" into 2 or 3 groups let's say group 0,1 and 2 and then you can classify. I believe that the classification performance can be very high. Let me know if you want me to modify my answer and turn it into a classification problem – seralouk Jun 18 '18 at 16:08
  • Would you be able to add it to your answer? I split the column up in 0-3 by <500 being 0, 500-1000 being 1, 1000-1200 being 2 and 1200+ being 3. Im using mlpclassifier now, but I still have to look up how to tune it properly, so any help would be appreciated – ChrisM Jun 18 '18 at 16:26
  • Yes I am going to provide an updated answer within the day. – seralouk Jun 19 '18 at 08:08
  • I believe you updated your code. In testing, Im getting 50% accuracy. It seems that it guesses mostly 2. there was a slight typo in your code, `y[np.logical_and(y >= 500, y <= 1000)] = 2 y[np.logical_and(y >= 1000, y <= 1200)] = 1` is what it should be, but using this actually gives me 28% accuracy. I am working on adding features now, I will update the question when i do... any thoughts so far? Thank you for all the help so far – ChrisM Jun 20 '18 at 22:02
  • Hello. I updated again my code. This time I normalized the data (see pipeline) but again the results are not good. I suggest adding more features. These features are not informative(enough) for higher classification accuracy. – seralouk Jun 21 '18 at 13:40
  • do you think it would be better to add as many features as possible, or combine some of the new ones into one feature if they are related and similar? or does it depend? – ChrisM Jun 21 '18 at 16:06
  • I believe that adding new features could result in better results. Now, if you want to create combinations of the existing (or the new) features, you could use PCA. I can provide an example of this if needed. Also, I used leave-one-out cross validation in my last edit. – seralouk Jun 21 '18 at 16:44
  • I am finishing up gathering/organizing the new features and will post them when I can to see what you think is the better option...Ive got about 15 more features – ChrisM Jun 21 '18 at 17:32
  • I have edited the original post with the most recent code (yours, mostly) and the updated data... over 200 features – ChrisM Jun 22 '18 at 23:58
  • Hello. I updated my answer showing 1) how to load the data much easier 2) how to apply PCA if you want to reduce the dimensions before the classification. Still the performance is not very good but think that you get 0.60 accuracy and the change level is 0.25 since you have 4 classes in `y`. To improve this, you need to play with different preprocessing functions and different classifiers. cheers – seralouk Jun 23 '18 at 10:48
  • Do you mind explaining what you mean by 4 levels in y? And do you have any recommendations for preprocessing? Thanks for all your help thus far – ChrisM Jun 25 '18 at 06:01
  • You are trying to classify some data into 4 classes (y has 0,1,2,3 values representing the classes). Thus the chance level of the classification is 100/4 = 25. If you had only two classes (e.g. y 0 and 1) then the chance level would be 50%. For the preprocessing, I suggest trying `MinMaxScaler` and `StandardScaler`. The features that you have are very different from each other and I strongly recommend to preprocess them before any model is applied. – seralouk Jun 25 '18 at 08:18
  • Should I scale only the data that is numeric and not categorical? or just all data in one? – ChrisM Jun 25 '18 at 16:07
  • The numerical in my opinion. Make sure that you fit the preprocessing function using the training data and then apply it on the testing data before the classification. This is done automatically in my code using `cross_val_score ` and the `Pipeline`. See also my answer here: https://stackoverflow.com/a/50567308/5025009 – seralouk Jun 27 '18 at 08:30
  • Funny that you give that link, I actually referenced that exact post (and your answer to it) when I first tried using minmaxscaler! Do you mind explaining what pipeline does? Also, do you think it would be better to train on one year then test on another (days 1-365) to see that pattern? or can the model pick it out using cross val anyway? Also, are ML models able to pick up on a pattern like this: an ad goes out on tuesday which brings in people on the weekend. The way I have current data is that there is a 1 when the event happens, and im trying to take into account more data such as ads – ChrisM Jun 27 '18 at 16:13
  • Finally, what are some other good preprocessing steps? I am trying PCA in addition, and some of the new features I added have missing data (almost 1/3). Any recommendations on those? – ChrisM Jun 27 '18 at 16:14
  • 1) For missing data, you can replace the NaN with the mean or median of the variable or finally exclude them. 2) Pipeline is a pipeline that defines the order of the events. Then you feed the pipeline to the cross_val_score. Inside, the data are split based on the `cv` that you pass and the preprocessing is first estimated and applied on the training data and then applied to the testing data. More details (step by step) can be found in the link that I provided. – seralouk Jun 27 '18 at 19:44
  • 3) I tried PCA when I played with your data and using only 1 Component we can explain 90% of the variance. However, the predictive ability is limited again. I sugged to play with some preprocessing steps and different classifiers but you data seem to be complex. – seralouk Jun 27 '18 at 19:44
  • Does 1 component explaining 90% of the variance suggest many of the features are correlated? If so, would there be a benefit to combining these features BEFORE any preprocessing (aka manually)? would this combined with adding more features, hopefully uncorrelated, help reduce the complexity of this, or does PCA/other methods handle it anyway and thus manually removing any features is near useless? Thanks again for all of the help so far, what started as a somewhat simple task has turned into a complex challenge. Ive already learned so much, and the problem is not solved yet – ChrisM Jun 27 '18 at 21:09