7

I convert my image data to caffe db format (leveldb, lmdb) using C++ as example I use this code for imagenet.

Is data need to be shuffled, can I write to db all my positives and then all my negatives like 00000000111111111, or data need to be shuffled and labels should look like 010101010110101011010?

How caffe sample data from DB, is it true that it use random subset of all data with size = batch_size?

Shai
  • 111,146
  • 38
  • 238
  • 371
mrgloom
  • 20,061
  • 36
  • 171
  • 301

1 Answers1

10

Should you shuffle the samples? Think about the learning process if you don't shuffle; caffe sees only 0 samples - what do you expect the algorithm to deduce? simply predict 0 all the time and everything is cool. If you have plenty of 0 before you hit the first 1 caffe will be very confident in predicting always 0. It will be very difficult to move the model from this point.
On the other hand, if it constantly sees a mix of 0 and 1 it learns from the beginning meaningful features for separating the examples.
Bottom line: it is very advantageous to shuffle the training samples, especially when using SGD-based approaches.

AFAIK, caffe does not randomly sample batch_size samples, but rather goes sequentially over the input DB batch_size after batch_size samples.

TL;DR
shuffle.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • 4
    Also found this https://github.com/BVLC/caffe/issues/1087 `the reason things are sequentially read is for performance purpose - random access on conventional HDDs is near disaster.` Another question what happens when batch_size*number_iters > number_samples? It just begin to sample from the start of DB? – mrgloom Jun 06 '16 at 14:02
  • 1
    @mrgloom caffe cycles through the data, over and over again until it reaches `number_iter`. – Shai Jun 06 '16 at 14:04