I am currently using the following loss function:
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits, labels))
However, my loss quickly approaches zero since there are ~1000 classes and only a handful of ones for any example (see attached image) and the algorithm is simply learning to predict almost entirely zeroes. I'm worried that this is preventing learning even though the loss continues to creep slightly towards zero. Are there any alternative loss functions that I should consider?