4

Although it is stated in the slim model that train_image_classifier.py can be used to train models from scratch, I found it hard in practice. In my case, I am trying to train ResNet from scratch on a local machine with 6xK80s. I used this:

DATASET_DIR=/nv/hmart1/ashaban6/scratch/data/imagenet_RF_record
TRAIN_DIR=/nv/hmart1/ashaban6/scratch/train_dir
DEPTH=50
NUM_CLONES=8

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8" python train_image_classifier.py --train_dir=${TRAIN_DIR} --dataset_name=imagenet --model_name=resnet_v1_${DEPTH} --max_number_of_steps=100000000 --batch_size=32 --learning_rate=0.1 --learning_rate_decay_type=exponential --dataset_split_name=train --dataset_dir=${DATASET_DIR} --optimizer=momentum --momentum=0.9 --learning_rate_decay_factor=0.1 --num_epochs_per_decay=30 --weight_decay=0.0001 --num_readers=12 --num_clones=$NUM_CLONES

I followed the same settings as is suggested in the paper. I am using 8 GPUs on a local machine with batch_size 32 so the effective batch size is 32x8=256. Learning rate is initially set to 0.1 and will be decayed by 10 every 30 epochs. After 70K steps (70000x256/1.2e6 ~ 15 epochs), the top-1 performance on the validation set is as low as ~14% while it should be around 50% after that many iterations. I used this command to get the top-1 performance:

DATASET_DIR=/nv/hmart1/ashaban6/scratch/data/imagenet_RF_record
CHECKPOINT_FILE=/nv/hmart1/ashaban6/scratch/train_dir/
DEPTH=50

CUDA_VISIBLE_DEVICES="10" python eval_image_classifier.py --alsologtostderr --checkpoint_path=${CHECKPOINT_FILE} --dataset_dir=${DATASET_DIR} --dataset_name=imagenet --dataset_split_name=validation --model_name=resnet_v1_${DEPTH}

With the lack of working examples it is hard to say if there is a bug in the slim training code or a problem in my script. It anything wrong in my script? Has anyone successfully trained the resent from scratch?

0 Answers0