Can we use Deep Learning networks to detect interesting or boring pictures?

Question

I'm handling a Deep Learning classification task to distinguish whether an image/video is boring or interesting. Based on ten-thousand labeled data(1. interesting 2. a little interesting 3. normal 4. boring), I used some pre-trained imagenet model(resnet / inception / VGG etc) to fine-tune my classification task.

My training error is very small, means it has been converged already. But test error is very high, accuracy is only around 35%, very similar with a random result.

I found the difficult parts are:

Same object has different label, for example, a dog on grass, maybe a very cute dog can be labeled as an interesting image. But an ugly dog may be labeled as a boring image.
Factors to define interesting or boring is so many, image quality, image color, the object, the environment... If we just detect good image quality image or we just detect good environment image, it may be possible, but how we can combine all these factors.
Every one's interesting point is different, I may be interested with pets, but some other one may think it is boring, but there are some common sense that everyone think the same. But how can I detect it?

At last, do you think it is a possible problem that can be solved using deep learning? If so, what will you do with this task?

score 2 · Accepted Answer · answered Jun 13 '17 at 06:57

2

This is a very broad question. I'll try and give some pointers:

"My training error is very small... But test error is very high" means you overfit your training set: your model learns specific training examples instead of learning general "classification rules" applicable to unseen examples.
This usually means you have too many trainable parameters relative to the number of training samples.
Your problem is not exactly a "classification" problem: classifying a "little interesting" image as "boring" is worse than classifying it as "interesting". Your label set has order. Consider using a loss function that takes that into account. Maybe "InfogainLoss" (if you want to maintain discrete labels), or "EuclideanLoss" (if you are willing to accept a continuous score).
If you have enough training examples, I think it is not too much to ask from a deep model to distinguish between an "interesting" dog image and a "boring" one. Even though the semantic difference is not large, there is a difference between the images, and a deep model should be able to capture it.
However, you might want to start your finetuning from a net that is trained for "aesthetic" tasks (e.g., MemNet, flickr style etc.) and not a "semantic" net like VGG/GoogLeNet etc.

answered Jun 13 '17 at 06:57

Shai

111,146
38
238
371

Is there InfogainLoss or EuclideanLoss in MXnet? I only found them in Caffe. – David Ding Jun 14 '17 at 14:18
@DavidDing I don't know MXnet. Euclidean loss is quite basic and I would expect MXnet to have an implementation of such loss (or a similar loss for regression tasks). – Shai Jun 14 '17 at 14:20
It seems mxnet support SoftmaxOutput, LinearRegressionOutput, LogisticRegressionOutput, MAERegressionOutput. And they provide a api of making loss http://mxnet.io/api/python/symbol.html#mxnet.symbol.make_loss I may try caffe in this case. – David Ding Jun 14 '17 at 14:40
@DavidDing without diving into the details, I would try LinearRegression first... – Shai Jun 14 '17 at 14:46
I see, LinearRegression is l2-loss, may be suitable for continuous score. And for InfogainLoss, what's your suggestion please? – David Ding Jun 14 '17 at 14:51

Can we use Deep Learning networks to detect interesting or boring pictures?

1 Answers1