4

I am attempting to do a one vs all multilabel classification. I feed a batch of input to each classifier along with expected labels. The classifiers use a softmax layer for output to predict a label as yes or no. Also I am using a softmax cross entropy loss for each classifier and each classifier tries to minimize its own loss. The classifiers keep minimizing their loss at each step but predict every label as zero.

I suspect this is because the positive examples for a label are pretty small compared to the size of the entire dataset.

Is this because I'm doing something wrong in the way I train my models or is it because of the asymmetric distribution of data for each individual label?

I'm hoping to limit the number of negative samples but just wanted to make sure that that is the correct direction to go.

Here's the code I am using for each classifier. I have a classifier for every label.

   self.w1 = tf.Variable(tf.truncated_normal([embedding_size, hidden_size],-0.1,0.1), dtype=tf.float32, name="weight1")
    self.b1 = tf.Variable(tf.zeros([hidden_size]), dtype=tf.float32, name="bias1")
    self.o1 = tf.sigmoid(tf.matmul(embed,self.w1) + self.b1)

    self.w2 = tf.Variable(tf.truncated_normal([hidden_size,2],-0.1,0.1), dtype=tf.float32, name="weight2")
    self.b2 = tf.Variable(tf.zeros([1]), dtype=tf.float32, name="bias2")
    self.logits = tf.matmul(self.o1, self.w2) + self.b2
    self.prediction = tf.nn.softmax(self.logits, name="prediction")

    self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=labels)) 
    self.optimizer = tf.train.AdamOptimizer(1e-3).minimize(self.loss)

EDIT: After using a simple multi-label classifier with sigmoid_cross_entropy_with_logits it still converges to zero. I'm posting the code for this version in case it helps:

    self.inp_x = tf.placeholder(shape=[None], dtype=tf.int32, name="inp_x")
    self.labels = tf.placeholder(shape=[None,num_labels], dtype=tf.float32, name="labels")
    self.embeddings = tf.placeholder(shape=[vocabulary_size,embedding_size], dtype=tf.float32,name="embeddings")
    self.embed = tf.nn.embedding_lookup(self.embeddings, self.inp_x)

    self.w1 = tf.Variable(tf.truncated_normal([embedding_size, hidden_size],-0.1,0.1), dtype=tf.float32, name="weight1")
    self.b1 = tf.Variable(tf.zeros([hidden_size]), dtype=tf.float32, name="bias1")
    self.o1 = tf.sigmoid(tf.matmul(self.embed,self.w1) + self.b1)
    self.w2 = tf.Variable(tf.truncated_normal([hidden_size,num_labels],-0.1,0.1), dtype=tf.float32, name="weight2")
    self.b2 = tf.Variable(tf.zeros([num_labels]), dtype=tf.float32, name="bias2")
    self.logits = tf.matmul(self.o1, self.w2) + self.b2
    self.prediction = tf.sigmoid(self.logits, name='prediction')

    self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits = self.logits, labels = self.labels))
    self.optimizer = tf.train.AdamOptimizer(1e-3).minimize(self.loss)
Abhishek Patel
  • 774
  • 5
  • 19

1 Answers1

3

Since you have not mentioned the actual data distribution, it is very difficult to guess whether the issue is with your code or with the dataset. However, you can try feeding a set which is uniformly distributed across the classes and check the result. If the problem is indeed a skewed distribution, you can try the following:

  1. Oversampling the positive (or minority classes) by copying their instances.
  2. Undersampling the majority class.
  3. Using a weighted loss function. Tensorflow has an inbuilt function called weighted_cross_entropy_with_logits which provides this functionality, albeit only for binary classification, where you can specify the pos_weight you want to assign the minority class.
  4. You could also filter negative instances manually, but this method requires some domain knowledge.
Desh Raj
  • 113
  • 1
  • 4
  • As for the data distribution I have a dataset of ~14000 entries and a 200-300 positive labels for each. – Abhishek Patel Apr 15 '17 at 16:51
  • What is the fraction of your "Other" class relative to the total number of instances? – Desh Raj Apr 15 '17 at 16:53
  • I'm not sure if I understand that question. I feed all the entries to each classifier, so I would say the ratio of positive to negative labels will be 3:140 for each classifier – Abhishek Patel Apr 15 '17 at 16:55
  • Okay, I didn't notice that you had different classifiers for each label. In that case, each of your classifiers is seeing a highly skewed distribution. Why don't you go for a simple multi-label classifier rather than multiple one-vs-all classifiers. Since your classes are more-or-less equally sized, it would provide a decent performance. – Desh Raj Apr 15 '17 at 16:58
  • I'll give that a try. My concern was that since the labels are independent it makes more sense to not have a common layer predicting them but I'll try the simple multi-class classifier. – Abhishek Patel Apr 15 '17 at 17:01
  • 1
    Weighted cross entropy works, although not to desired level of accuracy but definitely got out of predicting all zeroes. – Abhishek Patel Apr 15 '17 at 23:51