Creating large LMDBs for Caffe with numpy arrays

Question

I have two 60 x 80921 matrices, one filled with data, one with reference.
I would like to store the values as key/value pairs in two different LMDBs, one for training (say I'll slice around the 60000 column mark) and one for testing. Here is my idea; does it work?

X_train = X[:,:60000]
Y_train = Y[:,:60000]
X_test = X[:,60000:]
Y_test = Y[:,60000:]

X_train = X_train.astype(int)
X_test = X_test.astype(int)
Y_train = Y_train.astype(int)
Y_test = Y_test.astype(int)

map_size = X_train.nbytes * 10
env = lmdb.open('sensormatrix_train_lmdb', map_size=map_size)
with env.begin(write=True) as txn:  
    for i in range(60):
        for j in range(60000):
            datum = caffe.proto.caffe_pb2.Datum()
            datum.height = X_train.shape[0]
            datum.width = X_train.shape[1]
            datum.data = X_train[i,j].tobytes()
            datum.label= int(Y[i,j])
            str_id= '{:08}'.format(i)

I'm really not sure of the code. And what does the last line format(i) refer to?

why don't you use `"HDF5Data"` input layers? you have `h5py` package to store numpy arrays as hdf5 data files. See an example [here](http://stackoverflow.com/a/34261942/1714410) (the example uses matlab to write the data, but it is even simpler in python using `h5py`). — Shai, Apr 06 '16 at 11:49

Shai · Answer 1 · 2016-04-06T12:51:37.617

0

It's not 100% clear what you are trying to do: are you treating each entry as a separate data sample, or are you trying to train on 60K 1D vectors of dim=60...

Assuming you have 60K training samples of dim 60, you can write the training lmdbs like this:

env_x = lmdb.open('sensormatrix_train_x_lmdb', map_size=map_size) # you can put map_size a little bigger 
env_y = lmdb.open('sensormatrix_train_y_lmdb', map_size=map_size)
with env_x.begin(write=True) as txn_x, env_y.begin(write=True) as txn_y:
    for i in xrange(X_train.shape[1]):
        x = X_train[:,i]
        y = Y_train[:,i] 

        datum_x = caffe.io.array_to_datum(arr=x.reshape((60,1,1)),label=i)
        datum_y = caffe.io.array_to_datum(arr=y.reshape((60,1,1)),label=i)
        keystr = '{:0>10d}'.format(i) # format an lmdb key for this entry
        txn_x.put( keystr, datum_x.SerializeToString() ) # actual write to lmdb
        txn_y.put( keystr, datum_y.SerializeToString() )

Now you have two lmdb for training, in your 'prototxt' you should have two corresponding "Data" layers:

layer {
  name: "input_x"
  top: "x"
  top: "idx_x"
  type: "Data"
  data_param { source: "sensormatrix_train_x_lmdb" batch_size: 32 }
  include { phase: TRAIN }
}
layer {
  name: "input_y"
  top: "y"
  top: "idx_y"
  type: "Data"
  data_param { source: "sensormatrix_train_y_lmdb" batch_size: 32 }
  include { phase: TRAIN }
}

To make sure you read corresponding x y pairs, you can add a sanity check

layer {
  name: "sanity"
  type: "EuclideanLoss"
  bottom: "idx_x"
  bottom: "idx_y"
  top: "sanity"
  loss_weight: 0 
  propagate_down: false
  propagate_down: false
}

edited Apr 06 '16 at 12:51

answered Apr 06 '16 at 12:11

Shai

111,146
38
238
371

1

I think I understand what you said, like 60% of it. Okay, I am indeed trying to train 60k vectors (and then using the remaining 21k vectors for 'testing'... Now um, I'm very confused about the details of what you said. (this is my first conv net using caffe, woohoo! Let me work out exactly how to phrase my questions and I'll get back to you. Thanks so much. But I suppose the main question is- how exactly are we getting our X vectors to match up with our Y vectors? – Christopher Turnbull Apr 06 '16 at 12:31
@ChristopherTurnbull please look into `"HDF5Data"` - I think it will suit you better in this case. – Shai Apr 06 '16 at 12:32
Thanks Shai, I will do. – Christopher Turnbull Apr 06 '16 at 12:34
Also, could you explain this error? ---> 12 env_x.put( keystr, datum_x.SerializeToString() ) # actual AttributeError: 'Environment' object has no attribute 'put' – Christopher Turnbull Apr 06 '16 at 12:47
@ChristopherTurnbull my bad. please see the corrected answer. – Shai Apr 06 '16 at 12:51
Thanks Shia. One more question- if I'm implementing softmax regression at the end, what does the layer definition look like? I'm assuming I need to define 2 layers? – Christopher Turnbull Apr 06 '16 at 13:05
Shia I'm not sure I understand your answer. I'm trying to train the X vectors to match the Ys... with two LMDBs how do I do this? – Christopher Turnbull Apr 07 '16 at 12:00
@ChristopherTurnbull as I wrote in the answer you'll have two `'Data'` layers in your model: one for `x` and one for `y`. The `"sanity"` layer makes sure that at each iteration the corresponding `x` and `y` are used. – Shai Apr 07 '16 at 12:28

Creating large LMDBs for Caffe with numpy arrays

1 Answers1

Linked