I'm handling a Deep Learning classification task to distinguish whether an image/video is boring or interesting. Based on ten-thousand labeled data(1. interesting 2. a little interesting 3. normal 4. boring), I used some pre-trained imagenet model(resnet / inception / VGG etc) to fine-tune my classification task.
My training error is very small, means it has been converged already. But test error is very high, accuracy is only around 35%, very similar with a random result.
I found the difficult parts are:
Same object has different label, for example, a dog on grass, maybe a very cute dog can be labeled as an interesting image. But an ugly dog may be labeled as a boring image.
Factors to define interesting or boring is so many, image quality, image color, the object, the environment... If we just detect good image quality image or we just detect good environment image, it may be possible, but how we can combine all these factors.
Every one's interesting point is different, I may be interested with pets, but some other one may think it is boring, but there are some common sense that everyone think the same. But how can I detect it?
At last, do you think it is a possible problem that can be solved using deep learning? If so, what will you do with this task?