For an experiment I have been running, Caffe has been crashing. My experiment involves training networks on different subsets of the same data using the AlexNet model. For each trial, I generate an LMDB for that particular subset of data and then modify my network .prototxt to match the parameters. For 40+ trials, I have had no issue. One particular trial, however, consistently crashes after 227 training iterations. The error given is simply "Bus error (core dumped)". This happens regardless of whether I do the training on GPU or CPU. Searching has yielded no results of anyone else who has had this error. Apparently it is some sort of memory addressing error. I am using an Nvidia DIGITS box with 64GB RAM and and 12GB of VRAM. The system monitor shows that I am using nowhere near the system's full memory. I can provide my prototxt if it might be helpful. However, the dataset is too large too upload (>20GB).
I1128 12:50:01.558748 20000 solver.cpp:228] Iteration 227, loss = 5.8273
I1128 12:50:01.558786 20000 solver.cpp:244] Train net output #0: loss = 5.8273 (* 1 = 5.8273 loss)
I1128 12:50:01.558796 20000 sgd_solver.cpp:106] Iteration 227, lr = 0.001
Bus error (core dumped)
According to this question, bus errors are nonexistant on modern Intel machines, which I am using. What could be causing this problem?