Text classification issue

Question

I'm newbie in ML and try to classify text into two categories. My dataset is made with Tokenizer from medical texts, it's unbalanced and there are 572 records for training and 471 for testing.

It's really hard for me to make model with diverse predict output, almost all values are same. I've tired using models from examples like this and to tweak parameters myself but output is always without sense

Here are tokenized and prepared data

Here is script: Gist

Sample model that I used

    sequential_model = keras.Sequential([
        layers.Dense(15, activation='tanh',input_dim=vocab_size),
        layers.BatchNormalization(),
        layers.Dense(8, activation='relu'),
        layers.BatchNormalization(),
        layers.Dense(1, activation='sigmoid')
    ])

    sequential_model.summary()
    sequential_model.compile(optimizer='adam',
                             loss='binary_crossentropy',
                             metrics=['acc'])

    train_history = sequential_model.fit(train_data,
                                         train_labels,
                                         epochs=15,
                                         batch_size=16,
                                         validation_data=(test_data, test_labels),
                                         class_weight={1: 1, 0: 0.2},
                                         verbose=1)

Unfortunately I can't share datasets. Also I've tired to use keras.utils.to_categorical with class labels but it didn't help

Hi Yaroslav, what exactly is your problem? I ran your code and you are already getting decent validation/training accuracy of 82-84%. Your loss curves makes sense as we see the network overfit to training set while we see the usual bowl-shaped validation curve . To make your network perform better, you can always deepen it (more layers), widen it (more units per hidden layer) and/or add more nonlinear activation functions for your layers to be able to map to a wider range of values. As an example, I changed the activation for the 2nd layer to be sigmoid, and saw the accuracy increase by 2-3% — Sachin Raghavendran, May 09 '19 at 18:12
@SachinRaghavendran Accuracy is high, but if you take a look at predict output you can see that all records classified to one class. I think that's because there is much more samples of label '0' in dataset. I've tried to adjust number of layers and parameters but result of prediction is always around one number with very small deviation. This is output with yours changes: [[0.421591] [0.395864] [0.395864] [0.421591] [0.395864] [0.395864] ... ] And these are train labels: [0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 ... ] Unfortunately, there is no sense, even though it's train data — Yaroslav Shulyak, May 10 '19 at 05:50
you said output is "[[0.421591] [0.395864] [0.395864] [0.421591] [0.395864] [0.395864] ... ]"; this is to be expected with sigmoid layer. The sigmoid activation function will squash all output values of forward pass to be between 0 and 1 (which is why you see the nonsense). Now, when the labels are generated, if the output is below 0.5, then the model will classify as 0 and the above 0.5 will classify as 1 (based on my looking at the source code, this is handled in model.evaluate). — Sachin Raghavendran, May 10 '19 at 21:02
Also, I believe the reason why you originally got so many repeated values is due to the size of your network. Apparently, each of the data points has roughly 20,000 features (pretty large feature space); the size of your network is too small and the possible space of output values that can be mapped to is consequently smaller. I did some testing with some larger hidden unit layers (and bumped up the number of layers) and was able to see that the prediction values did vary: [0.519], [0.41], [0.37]... (these would be mapped to binary values based on above comment) — Sachin Raghavendran, May 10 '19 at 21:12
It is also understandable that your network performance varies so because the number of features that you have is about 50 times the size of your training (usually you would like a smaller proportion). Keep in mind that training for too many epochs (like more than 10) for so small training and test dataset to see improvements in loss is not great practice as you can seriously overfit and is probably a sign that your network needs to be wider/deeper. — Sachin Raghavendran, May 10 '19 at 21:13
Thank you for tips. I knew about sigmoid layer but other were helpful. You can combine them as answer and I will accept — Yaroslav Shulyak, May 11 '19 at 07:58

score 1 · Accepted Answer · answered May 11 '19 at 21:20

Your loss curves makes sense as we see the network overfit to training set while we see the usual bowl-shaped validation curve.

To make your network perform better, you can always deepen it (more layers), widen it (more units per hidden layer) and/or add more nonlinear activation functions for your layers to be able to map to a wider range of values.

Also, I believe the reason why you originally got so many repeated values is due to the size of your network. Apparently, each of the data points has roughly 20,000 features (pretty large feature space); the size of your network is too small and the possible space of output values that can be mapped to is consequently smaller. I did some testing with some larger hidden unit layers (and bumped up the number of layers) and was able to see that the prediction values did vary: [0.519], [0.41], [0.37]...

It is also understandable that your network performance varies so because the number of features that you have is about 50 times the size of your training (usually you would like a smaller proportion). Keep in mind that training for too many epochs (like more than 10) for so small training and test dataset to see improvements in loss is not great practice as you can seriously overfit and is probably a sign that your network needs to be wider/deeper.

All of these factors, such as layer size, hidden unit size and even number of epochs can be treated as hyperparameters. In other words, hold out some percentage of your training data as part of your validation split, go one by one through the each category of factors and optimize to get the highest validation accuracy. To be fair, your training set is not too high, but I believe you should hold out some 10-20% of the training as a sort of validation set to tune these hyperparameters given that you have such a large number of features per data point. At the end of this process, you should be able to determine your true test accuracy. This is how I would optimize to get the best performance of this network. Hope this helps.

More about training, test, val split

Text classification issue

1 Answers1