0

I'm trying to train a model to predict departure delay based on airline, day of the month, Dest and Origin. I tried several approaches but the accuracy is very low. enter image description here Fist I used the delay labels directly varying from -20 to +20 min, I tried making it easier by setting intervals so : for delays in [0 5[ => 0 [5 10] => 1 ..etc

but still the accuracy is bad and I tried several approaches ;

Changing the layers
Not normalizing the features removing and adding new features

But still I can't find something that works

################### Load the dataset
df= dataset[['UniqueCarrier','DayofMonth','DepDelay','Dest','Origin']]
df.tail()
df = df.dropna()
df = df[(df['DepDelay'] <= 20) & (df['DepDelay'] <= 20)]
############### mask delay values
ask = (df.DepDelay > 0) &  (df.DepDelay < 5)
column_name = 'DepDelay'
df.loc[mask, column_name] = 0

mask = (df.DepDelay >= 5) &  (df.DepDelay < 10)
column_name = 'DepDelay'
df.loc[mask, column_name] = 1

mask = (df.DepDelay >= 10) &  (df.DepDelay < 15)
column_name = 'DepDelay'
df.loc[mask, column_name] = 2

mask = (df.DepDelay >= 15) &  (df.DepDelay <= 20)
column_name = 'DepDelay'
df.loc[mask, column_name] = 3

mask = (df.DepDelay >= -5) &  (df.DepDelay < 0)
column_name = 'DepDelay'
df.loc[mask, column_name] = -1

mask = (df.DepDelay >= -10) &  (df.DepDelay < -5)
column_name = 'DepDelay'
df.loc[mask, column_name] = -2

mask = (df.DepDelay >= -15) &  (df.DepDelay < -10)
column_name = 'DepDelay'
df.loc[mask, column_name] = -3

mask = (df.DepDelay >= -20) &  (df.DepDelay < -15)
column_name = 'DepDelay'
df.loc[mask, column_name] = -4
############### Splitting labels and features
y= df['DepDelay']

df.drop(columns = ['DepDelay'], inplace = True, axis = 1)
################ replacing character values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['Dest'] = le.fit_transform(df.Dest.values)
df['Origin'] = le.fit_transform(df.Origin.values)
df['UniqueCarrier'] = le.fit_transform(df.UniqueCarrier.values
########################## normalization
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
# Normalize Training Data 
std_scale = preprocessing.StandardScaler().fit(df)

df_norm = std_scale.transform(df)
training_norm_col1 = pd.DataFrame(df_norm, index=df.index, 
    columns=df.columns) 
df.update(training_norm_col1)
print (df.head())
########################## THE model
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.models import Sequential
from keras.layers import Dense, Dropout
import matplotlib.pyplot as plt
import numpy
class LossHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.losses = []

    def on_batch_end(self, batch, logs={}):
        self.losses.append(logs.get('loss'))

model = Sequential()
model.add(Dense(64, input_dim=4, activation='relu'))
model.add(Dense(30, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(1))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam', metrics= 
    ['accuracy'])
# Fit the model
history = LossHistory()
model.fit(df, y, validation_split=0.33, epochs=1000, 
    batch_size=50,verbose=1, callbacks=[history])
print(history.losses)

the accuracy is about : 0.3524 while training. THE DATAFRAME for traning is about 3M rows

desertnaut
  • 57,590
  • 26
  • 140
  • 166
YSN BRYAN
  • 15
  • 6

3 Answers3

1

When using loss = 'mean_squared_error' and no activation (i.e. the default linear one) in a single-node final layer, as you do here, you are in a regression setting, where accuracy is meaningless (it is meaningful only in classification problems).

Unfortunately, Keras will not "protect" you in such a case, insisting in computing and reporting back an "accuracy", despite the fact that it is meaningless and inappropriate for your problem - see my answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)?

If you want to stick to a regression setting, you should simply remove metrics=['accuracy'] from your model compilation, and don't bother - in regression settings, MSE itself can (and usually does) serve also as the performance metric. But this means that you will try to directly predict numeric values, not "labels" coming from the binning, as you describe.

If you want to predict binned intervals like

[0 5] => 0 
[5 10] => 1 

etc, i.e. work in a classification setting, you should change your loss to categorical_cross_entropy and keep the accuracy as your metric. Keep in mind that you should also convert your labels to one-hot-encoded ones (see Keras to_categorical), and replace your final layer with

model.add(Dense(num_classes, activation='softmax'))

where num_classes is the number of classes resulting from your binning procedure.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
0

Looking at your data set, you do have a mixture of classification and regression issue. As you might very well be able to use Keras to model, but if your case is regression, then the classification becomes meaningless. On the other hand, my suggestion is trying to use Decision Trees.

Arad Haselirad
  • 314
  • 2
  • 13
  • In such cases, one can be either in a regression context or in a classification one - there are no "mixtures"... – desertnaut Feb 15 '19 at 15:50
  • Actually I decided to switch the problem to a classification problem, using the intervals as multiclass labels. Hope it will work better – YSN BRYAN Feb 15 '19 at 17:17
0

My experience (e.g. age estimation) says it will be always better to train the network with a combined loss, i.e. regression + classification.

I think you have already figured out how to do the problem in the classification way, which is to quantize your target outputs into predefined bins. As a result, the your classification output will predict the probability of a sample belonging to a bin.

Without loss of generality, say you have N bins, and the center value of the kth bin is c[k]. Now the question is how do you do inference, i.e. given a testing sample, how to estimate the exact flight delay. In other words, you need to convert the classification problem back to the regression problem during the testing time anyway, unless you are satisfied with a bin estimate.

One simple way to estimate the flight delay (fd) is to take the weighted average of your bin classification results, i.e.

fd = np.sum(proba * centers)

where proba is the bin probability from clf.predict(sample), and centers is the the center value of your bins, i.e. centers=[c[k] for k in range(N)].

Of course, there are other ways you may use during the inference time, but let's just use this one as the example. Now the question is how to integrate this inference function to the loss function?

I hope you have already got the answer, which is just to compute loss between fd inferenced from using the above formula and that from the ground truth.

Assume you have a keras model that performs the classification task, below is an example to train it with the regression losses

from keras import backend as K
centers = K.variable(value=np.array([...]),dtype='float32') # shape of 1xN, fill in your center values

def regLoss(y_true, y_pred) :
    # Note: 
    # a. your y_true will be the actual delay time, not bin membership
    # b. y_pred is still the same as that for the classification task, i.e. the bin membership

    # 1. convert your y_pred to flight delay
    y_pred = K.sum( centers * y_pred, axis=-1)
    # 2. compute loss between flight delay numbers
    return keras.losses.mae(y_true, y_pred)

Now you can train the same model with new regression loss.

As I mentioned earlier, it will be better to do with both the regression and classification loss. Because using them together will help you optimize the network in a better way. Why?

Because when only using classification loss, given

gt=[1,0,0,0,0,0]
p1=[0,1,0,0,0,0]
p2=[0,0,0,0,0,1]

you will have L(gt,p1) = L(gt,p2). However, when you think about your problem, what we really want is L(gt,p1) < L(gt,p2), and this part will be covered after introducing the regression loss.

In the same time, the problem of using the regression loss only is that you really don't know what are the physical meanings of the features to predict the target value, but you know if one of them meets an outlier value, you mess up your prediction. With the classification loss, you know the direct feature used for regression is the bin membership.

pitfall
  • 2,531
  • 1
  • 21
  • 21