I am using Theano/Pylearn2 to implement LSTM model inside my own network. However, I've found that Theano scan is much, much slower than using plain loops. I used the Theano profiler
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
95.4% 95.4% 25.255s 4.31e-02s Py 586 3 theano.scan_module.scan_op.Scan
1.8% 97.2% 0.466s 4.72e-05s C 9864 41 theano.sandbox.cuda.basic_ops.GpuElemwise
0.8% 97.9% 0.199s 8.75e-05s C 2276 10 theano.sandbox.cuda.basic_ops.GpuAlloc
0.7% 98.7% 0.196s 1.14e-04s C 1724 8 theano.sandbox.cuda.blas.GpuDot22
0.3% 99.0% 0.087s 1.06e-04s C 828 3 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.2% 99.2% 0.051s 1.66e-04s Py 310 2 theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1
and the Ops,
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
77.2% 77.2% 20.433s 7.40e-02s Py 276 1 forall_inplace,gpu,grad_of_lstm__layers}
18.2% 95.4% 4.822s 1.56e-02s Py 310 2 forall_inplace,gpu,lstm__layers}
So lots and lots of time are spent on Scan (which is kind of as expected, but I didn't expect it to be soo slow).
The main body of my code is
def fprop(self, state_below, state_prev = 0, cell_prev = 0):
if state_prev == None:
state_prev = self.state_prev;
if cell_prev == None:
cell_prev = self.cell_prev;
i_gate = T.nnet.sigmoid(T.dot(state_below,self.Wi) +
T.dot(state_prev,self.Ui));
f_gate = T.nnet.sigmoid(T.dot(state_below,self.Wf) +
T.dot(state_prev,self.Uf));
C = T.tanh(T.dot(state_below, self.Wc) +
T.dot(state_prev, self.Uc));
C = i_gate * C + f_gate * cell_prev;
o_gate = T.nnet.sigmoid(T.dot(state_below,self.Wo) +
T.dot(state_prev,self.Uo) +
T.dot(C, self.Vo));
h_out = o_gate * T.tanh(C);
return h_out, C
And I wrote my scan as:
[h,c,out], _ = theano.scan(fn=self.fprop_with_output,
sequences=[X.T,Y[:,1:].T],
outputs_info=[dict(initial=h_,taps=[-1]), dict(initial=c_,taps=[-1]),None],n_steps=X.shape[1]-1);
One thing I've noticed is that the type of Theano scan uses Python implementation (?) is that the reason why this is ridiculously slow? or did I do something wrong? Why is Theano python implementation of Scan instead of C's.
(I said using loops is faster, but it's faster at runtime, for large model I didn't manage to compile the version of using loops within reasonable amount of time).