LSTM for 30 classes, badly overfitting, cannot go over 76% test accuracy

Question

How to classify job descriptions into their respective industries?

I'm trying to classify text using LSTM, in particular converting job description Into industry categories, unfortunately the things I've tried so far Have only resulted in 76% accuracy.

What is an effective method to classify text for more than 30 classes using LSTM?

I have tried three alternatives

Model_1

Model_1 achieves test accuracy of 65%

embedding_dimension = 80
max_sequence_length = 3000
epochs = 50
batch_size = 100

    model = Sequential()
    model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))

    model.add(SpatialDropout1D(0.2))
    model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))

    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Model_2

Model_2 achieves test accuracy of 64%

    model = Sequential()
    model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))

    model.add(LSTM(100))

    model.add(Dropout(rate=0.5))
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))

    model.add(Dropout(rate=0.5))
    model.add(Dense(64, activation='relu', kernel_initializer='he_uniform'))

    model.add(Dropout(rate=0.5))
    model.add(Dense(output_dim, activation='softmax'))


    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

Model_3

Model_3 achieves test accuracy of 76%

    model.add(Embedding(max_words, embedding_dimension, input_length= x_shape, trainable=False))

    model.add(SpatialDropout1D(0.4))
    model.add(LSTM(100, dropout=0.4, recurrent_dropout=0.4))

    model.add(Dense(128, activation='sigmoid', kernel_initializer=RandomNormal(mean=0.0, stddev=0.039, seed=None)))
    model.add(BatchNormalization())

    model.add(Dense(64, activation='sigmoid', kernel_initializer=RandomNormal(mean=0.0, stddev=0.55, seed=None)) )
    model.add(BatchNormalization())

    model.add(Dense(32, activation='sigmoid', kernel_initializer=RandomNormal(mean=0.0, stddev=0.55, seed=None)) )
    model.add(BatchNormalization())

    model.add(Dense(output_dim, activation='softmax'))

    model.compile(optimizer= "adam" , loss='categorical_crossentropy', metrics=['acc'])

I'd like to know how to improve the accuracy of the network.

Please keep in mind, that "lakhs" is a number prefix not known to anyone outside of India. — JE_Muc, Nov 25 '20 at 09:33

score 2 · Answer 1 · answered Nov 26 '20 at 06:20

Start with a minimal base line

You have a simple network at the top of your code, but try this one as your baseline

model = Sequential()
model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))
model.add(LSTM(output_dim//4)),
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

The intuition here is to see how much work LSTM can do. We don't need it to output the full 30 output_dims (the number of classes) but instead a smaller set of features base the decision of the classes on.

Your larger networks have layers like Dense(128) with 100 input. That's 100x128 = 12,800 connections to learn.

Improving imbalance right away

Your data may have a lot of imbalance so for the next step, let's address that with a loss function called the top_k_loss. This loss function will make your network only train on the training examples that it is having the most trouble on. This does a great job of handling class imbalance without any other plumbing

def top_k_loss(k=16):
    @tf.function
    def loss(y_true, y_pred):
        y_error_of_true = tf.keras.losses.categorical_crossentropy(y_true=y_true,y_pred=y_pred)
        topk, indexs = tf.math.top_k( y_error_of_true, k=tf.minimum(k, y_true.shape[0]) )
        return topk
    return loss

Use this with a batch size of 128 to 512. You add it to your model compile like so

model.compile(loss=top_k_loss(16), optimizer='adam', metrics=['accuracy']

Now, you'll see that using model.fit on this will return some dissipointing numbers. That's because it is only reporting THE WORST 16 out of each training batch. Recompile with your regular loss and run model.evaluate to find out how it does on the training and again on the test.

Train for 100 epochs, and at this point you should already see some good results.

Next Steps

Make the whole model generate and testing into a function like so

def run_experiment(lstm_layers=1, lstm_size=output_dim//4, dense_layers=0, dense_size=output_dim//4):
    model = Sequential()
    model.add(Embedding(max_words, embedding_dimension, input_length=x_shape))
    for i in range(lstm_layers-1):
        model.add(LSTM(lstm_size, return_sequences=True)),
    model.add(LSTM(lstm_size)),
    for i in range(dense_layers):
        model.add(Dense(dense_size, activation='tanh'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss=top_k_loss(16), optimizer='adam', metrics=['accuracy'])
    model.fit(x=x,y=y,epochs=100)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    loss, accuracy = model.evaluate(x=x_test, y=y_test)
    return loss

that can run a whole experiment for you. Now it is a matter of finding a better architecture by searching. One way to search is random. Random is actually really good. If you want to get fancy, I recommend hyperopt. Don't bother with grid search, random usually beats it for large search spaces.

best_loss = 10**10
best_config = []

for trial in range(100):
    config = [
        randint(1,4), # lstm layers
        randint(8,64), # lstm_size
        randint(0,8), # dense_layers
        randint(8,64) # dense_size
    ]
    result = run_experiment(*config)
    if result < best_loss:
        best_config = config
        print('Found a better loss ',result,' from config ',config)